Linux Performance Tuning and Stabilization Tips: Yoshinori Matsunobu
Linux Performance Tuning and Stabilization Tips: Yoshinori Matsunobu
Yoshinori Matsunobu
Lead of MySQL Professional Services APAC Sun Microsystems [email protected]
Table of contents
Memory and Swap space management Synchronous I/O, Filesystem, and I/O scheduler Useful commands and tools
iostat, mpstat, oprofile, SystemTap, gdb
16-64GB RAM is now pretty common *hot application data* should be cached in memory Minimizing hot application data size is important
Use compact data types (SMALLINT instead of VARCHAR/BIGINT, TIMESTAMP instead of DATETIME, etc) Do not create unnecessary indexes Delete records or move to archived tables, to keep hot tables smaller
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 3
DBT-2 benchmark (write intensive) 20-25GB hot data (200 warehouses, running 1 hour) Nehalem 2.93GHz x 8 cores, MySQL 5.5.2, 4 RAID1+0 HDDs RAM size affects everything. Not only for SELECT, but also for INSERT/UPDATE/DELETE
INSERT: Random reads/writes happen when inserting into indexes in random order UPDATE/DELETE: Random reads/writes happen when modifying records
Buffered I/O
Direct I/O
InnoDB Buffer Pool InnoDB Buffer Pool Filesystem Cache RAM RAM
Direct I/O is important to fully utilize Memory innodb_flush_method=O_DIRECT Alignment: File i/o unit must be a factor of 512 bytes Cant use O_DIRECT for InnoDB Log File, Binary Log File, MyISAM, PostgreSQL data files, etc
The Worlds Most Popular Open Source Database 5
Swap: 35650896k total, 4749460k used, 30901436k free, 819840k cached PID USER 5231 mysql PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
Swap is bad
Process spaces are written to disk (swap out) Disk reads happen when accessing on-disk process spaces (swap in) Massive random disk reads and writes will happen
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 6
When neither RAM nor swap space is available, OOM killer is invoked. OOM Killer may kill any process to allocate memory space The most memory-consuming process (mysqld) will be killed at first
Its abort shutdown. Crash recovery takes place on restart Priority is determined by ORDER BY /proc/<PID>/oom_score DESC Normally mysqld has the highest score Depending on VMsize, CPU time, running time, etc
It often takes very long time (minutes to hours) for OOM Killer to kill processes
We cant do anything until enough memory space is available
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 7
If no memory space is available, OOM killer will be invoked Some CPU cores consume 100% system resources
24.9% (average) = 1 / 4 core use 100% cpu resource in this case Terminal freezed (SSH connections cant be established)
We want to keep mysqld in RAM, rather than allocating large filesystem cache
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 10
25 0 27.0g 27g 288 S 0.0 92.6 7:40.88 mysqld Copying 8GB datafile
10240k buffers
Swap: 35650896k total, 4749460k used, 30901436k free, 8819840k cached PID USER 5231 mysql PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
vm.swappiness = 0
Mem: 32967008k total, 28947472k used, 4019536k free, 152520k buffers Swap: 35650896k total, PID USER 5231 mysql 0k used, 35650896k free, 197824k cached TIME+ COMMAND PR NI VIRT RES SHR S %CPU %MEM
25 0 27.0g 27g 288 S 0.0 91.3 7:55.88 mysqld Copying 8GB of datafile 3940k buffers 216k used, 35650680k free, 4117432k cached TIME+ COMMAND
Mem: 32967008k total, 32783668k used, 183340k free, Swap: 35650896k total, PID USER 5231 mysql
When physical RAM was fully consumed, Linux kernel reduces filesystem cache with high priority (lower swappiness increases priority) After no file system cache is available, swapping starts
OOM killer wont be invoked if large enough swap space is allocated. Its safer
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 12
Memory allocator
mysqld uses malloc()/mmap() for memory allocation Faster and more concurrent memory allocator such as tcmalloc can be used
Install Google Perftools (tcmalloc is included)
# yum install libunwind # cd google-perftools-1.5 ; ./configure --enable-frame-pointers; make; make install
DBT-2 benchmark (write intensive) Nehalem 2.93GHz x 8 cores, MySQL 5.5.2 20-25GB hot data (200 warehouses, running 1 hour)
14
In some cases too high per-session memory allocation causes negative performance impacts
SELECT * FROM huge_myisam_table LIMIT 1; SET read_buffer_size = 256*1024; (256KB) -> 0.68 second to run 10,000 times SET read_buffer_size = 2048*1024; (2MB) -> 18.81 seconds to run 10,000 times
In many cases MySQL does not allocate per-session memory than needed. But be careful about some extreme cases (like above: MyISAM+LIMIT+FullScan)
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 15
Table of contents
Memory and Swap space management Synchronous I/O, Filesystem, and I/O scheduler Useful commands and tools
iostat, mpstat, oprofile, SystemTap, gdb
16
RDBMS calls fsync() many times (per transaction commit, checkpoints, etc) Make sure to use Battery Backed up Write Cache (BBWC) on raid cards
10,000+ fsync() per second, without BBWC less than 200 on HDD Disable write cache on disks for safety reasons
disk
17
Overwriting or Appending?
Some files are overwritten (fixed file size), others are appended (increasing file size)
Overwritten: InnoDB Logfile Appended: Binary Logfile
19
By far the most widely used filesystem But not always the best Deleting large files takes long time
Filesystem ext3
Internally has to do a lot of random disk i/o (slow on HDD) In MySQL, if it takes long time to DROP table, all client threads will be blocked to open/close tables (by LOCK_open mutex) Be careful when using MyISAM, InnoDB with innodb_file_per_table, PBXT, etc
Filesystem xfs/ext2
xfs
Fast for dropping files Concurrent writes to a file is possible when using O_DIRECT Not officially supported in current RHEL (Supported in SuSE) Disable write barrier by setting nobarrier
ext2
Faster for writes because ext2 doesnt support journaling It takes very long time for fsck On active-active redundancy environment (i.e. MySQL Replication), in some cases ext2 is used to gain performance
HDD(ext3)
HDD(xfs)
Intel(ext3)
Intel(xfs)
I/O scheduler
Note: RDBMS (especially InnoDB) also schedules I/O requests so theoretically Linux I/O scheduler is not needed Linux has I/O schedulers
to efficiently handle lots of I/O requests I/O scheduler type and Queue Size matters
cfq madness
Running two benchmark programs concurrently 1. Multi-threaded random disk reads (Simulating RDBMS reads) 2. Single-threaded overwriting + fsync() (Simulating redo log writes) Random Read write+fsync() i/o threads running 1 100 1 No No Yes Scheduler noop/deadline cfq noop/deadline cfq noop/deadline cfq noop/deadline cfq
reads/sec from iostat 260 260 2100 2100 212 248 1915 2084
In RDBMS, write IOPS is often very high because HDD + write cache can handle thousands of transaction commits per second (write+fsync) Write iops was adjusted to per-thread read iops in cfq, which reduced total iops significantly Verified on RHEL5.3 and SuSE 11, Sun Fire X4150, 4 HDD H/W RAID1+0
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 25
RAID1+0 RAID5
- Sun Fire X4150 (4 HDDs, H/W RAID controller+BBWC) - RHEL5.3 (2.6.18-128) - Built-in InnoDB 5.1
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 26
Queue size = N
Sorting N outstanding I/O requests to optimize disk seeks
MyISAM does not optimize I/O requests internally Highly depending on OS and storage When inserting into indexes, massive random disk writes/reads happen Increasing I/O queue size reduces disk seek overheads
# echo 100000 > /sys/block/sdX/queue/nr_requests
No impact in InnoDB
Many RDBMS including InnoDB internally sort I/O requests
The Worlds Most Popular Open Source Database 27
28
iostat
Detailed I/O statistics per device Very important tool because in most cases RDBMS becomes I/O bound iostat -x Check r/s, w/s, svctm, %util
IOPS is much more important than transfer size
Always %util = (r/s + w/s) * svctm (hard coded in the iostat source file)
# iostat -xm 10 avg-cpu: %user %nice %system %iowait %steal %idle 21.16 0.00 6.14 29.77 0.00 42.93 Device: rqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 2.60 389.01 283.12 47.35 4.86 2.19 43.67 4.89 14.76 3.02 99.83
29
In both cases %util is almost 100%, but r/s and w/s are far different Do not trust %util too much Check svctm rather than %util
If your storage can handle 1000 IOPS, svctm should be less than 1.00 (ms) so you can send alerts if svctm is higher than 1.00 for a couple of minutes
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 30
mpstat
Per CPU core statistics vmstat displays average statistics Its very commom that only one of CPU cores consumes 100% CPU resources
The rest CPU cores are idle Especially applies to batch jobs
If you check only vmstat/top/iostat/sar you will not notice single threaded bottleneck You can also check network bottlenecks (%irq, %soft) from mpstat
vmstat counts them as %idle
31
Oprofile
Profiling CPU usage from running processes You can easily identify which functions consume CPU resources Supporting both user space and system space profiling Mainly used by database-internal developers If specific functions consume most of recourses, applications might be re-designed to skip calling them Not useful to check low-CPU activities
I/O bound, mutex waits, etc
How to use
opcontrol --start (--no-vmlinux) benchmarking opcontrol --dump opcontrol --shutdown opreport -l /usr/local/bin/mysqld
33
# opreport l /usr/local/bin/mysqld samples % symbol name 83003 8.8858 String::copy(char const*, unsigned int, charset_info_st*, charset_info_st*, unsigned int*) 79125 8.4706 MYSQLparse(void*) 68253 7.3067 my_wc_mb_latin1 55410 5.9318 my_pthread_fastmutex_lock 34677 3.7123 my_utf8_uni 18359 1.9654 MYSQLlex(void*, void*) 12044 1.2894 _ZL15get_hash_symbolPKcjb 11425 1.2231 _ZL20make_join_statisticsP4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
You can see quite a lot of CPU resources were spent for character conversions (latin1 <-> utf8) Disabling character code conversions on application side will improve performance (20% in this case)
Oprofile example
samples % symbol name 83107 10.6202 MYSQLparse(void*) 68680 8.7765 my_pthread_fastmutex_lock 20469 2.6157 MYSQLlex(void*, void*) 13083 1.6719 _ZL15get_hash_symbolPKcjb 12148 1.5524 JOIN::optimize() 11529 1.4733 _ZL20make_join_statisticsP4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 34
SystemTap provides a simple command line interface and scripting language for writing instrumentation for a live running kernel (and applications). Similar to DTrace SystemTap script runs as a Linux Kernel module No-need to rebuild applications to profile
kernel-header/devel, kernel-debuginfo packages are needed
SystemTap
Supported in RHEL 5 by default (5.4 is more stable than older versions, but be careful about bug reports) User level functions can be profiled if a target program has DWARF debugging symbols
MySQL official binary has DWARF symbols, so you do not need to rebuild mysqld Add -g if you build MySQL by yourselves
You can write custom C code inside a SystemTap script, but its limited
This is called guru mode Easily causes kernel panic. Be extremely careful Since it is kernel module, user-side libraries can not be used
35
iostat provides per-device i/o statistics, iotop provides perprocess i/o statistics
Not enough for mysqld
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 36
probe syscall.read, syscall.pread { if (execname() == mysqld) { readstats[pid(), fd] <<< count fdlist[pid(), fd] = fd } } probe timer.s($1) { foreach ([pid+, fd] in fdlist) { reads=@count(readstats[pid, fd]) rbytes=@sum(rdbs[pid, fd]) print %d %d %d\n, fd, reads, rbytes ...
Sample Code
#!/bin/sh stap filestat.stap 10 | perl sum.pl
[root #] stap sort.stp # of returned rows sorted by old algorithm: 0 # of returned rows sorted by new algorithm: 100
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 38
Old algorithm 1) Load into sort buffer rating RowID rating RowID 4.71 1 4.71 1 3.32 2 4.50 4 4.10 3 4.10 3 4.50 4 3.32 3 2) Sort user_id 100 2 3 10
rating title 4.71 UEFA CL: Inter vs Chelsea 3.32 Denmark vs Japan, 3-0 2) Sort 4.10 MySQL Administration 4.50 Linux tuning
The Worlds Most Popular Open Source Database 39
SystemTap Script 2
global oldsort=0; global newsort=0; probe process("/usr/local/bin/mysqld").function("*rr_from_pointers*").return { oldsort++; } probe process("/usr/local/bin/mysqld").function("*rr_unpack_from_buffer*").return { newsort++; } probe end { printf("# of returned rows printf("# of returned rows } ----[root #] stap sort.stp # of returned rows sorted by # of returned rows sorted by
Copyright 2010 Sun Microsystems inc
sorted by old algorithm: %d \n", oldsort); sorted by new algorithm: %d \n", newsort);
gdb
Debugging tool gdb has a functionality to take thread stack dumps from a running process (similar to Solaris truss) Useful to identify where and why mysqld hangs up, slows down, etc
But you have to read MySQL source code
41
Suddenly all queries were not responding for 1-10 seconds Checking slow query log All queries are simple enough, its strange to take 10 seconds CPU util (%us, %sy) were almost zero SHOW GLOBAL STATUS, SHOW FULL PROCESSLIST were not helpful
42
gdbtrace() { PID=`cat /var/lib/mysql/mysql.pid` STACKDUMP=/tmp/stackdump.$$ echo 'thread apply all bt' > $STACKDUMP echo 'detach' >> $STACKDUMP echo 'quit' >> $STACKDUMP gdb --batch --pid=$PID -x $STACKDUMP } while loop do CONN=`netstat -an | grep 3306 | grep ESTABLISHED | wc | awk '{print $1}'` if [ $CONN -gt 100 ]; then gdbtrace() done sleep 3 done
Copyright 2010 Sun Microsystems inc
43
..... Thread 73 (Thread 0x46c1d950 (LWP 28494)): #0 0x00007ffda5474384 in __lll_lock_wait () from /lib/libpthread.so.0 #1 0x00007ffda546fc5c in _L_lock_1054 () from /lib/libpthread.so.0 #2 0x00007ffda546fb30 in pthread_mutex_lock () from /lib/libpthread.so.0 #3 0x0000000000a0f67d in my_pthread_fastmutex_lock (mp=0xf46d30) at thr_mutex.c:487 #4 0x000000000060cbe4 in dispatch_command (command=16018736, thd=0x80, packet=0x65 <Address 0x65 out of bounds>, packet_length=4294967295) at sql_parse.cc:969 #5 0x000000000060cb56 in do_command (thd=0xf46d30) at sql_parse.cc:854 #6 0x0000000000607f0c in handle_one_connection (arg=0xf46d30) at sql_connect.cc:1127 #7 0x00007ffda546dfc7 in start_thread () from /lib/libpthread.so.0 #8 0x00007ffda46305ad in clone () from /lib/libc.so.6 #9 0x0000000000000000 in ?? () Many threads were waiting at pthread_mutex_lock(), called from sql_parse.cc:969
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 44
Stack Trace
Reading sql_parse.cc:969
953 bool dispatch_command(enum enum_server_command command, THD *thd, 954 char* packet, uint packet_length) 955 { 956 NET *net= &thd->net; 957 bool error= 0; 958 DBUG_ENTER("dispatch_command"); 959 DBUG_PRINT("info",("packet: '%*.s'; command: %d", packet_length, packet, command)); 960 961 thd->command=command; 962 /* 963 Commands which always take a long time are logged into 964 the slow log only if opt_log_slow_admin_statements is set. 965 */ 966 thd->enable_slow_log= TRUE; 967 thd->lex->sql_command= SQLCOM_END; /* to avoid confusing VIEW detectors */ 968 thd->set_time(); 969 VOID(pthread_mutex_lock(&LOCK_thread_count));
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 45
46
4795 void create_thread_to_handle_connection(THD *thd) 4796 { 4797 if (cached_thread_count > wake_thread) 4798 { 4799 /* Get thread from cache */ 4800 thread_cache.append(thd); 4801 wake_thread++; 4802 pthread_cond_signal(&COND_thread_cache); 4803 } 4804 else 4805 { 4811 if ((error=pthread_create(&thd->real_id,&connection_attrib, 4812 handle_one_connection, 4813 (void*) thd))) 4839 } 4840 (void) pthread_mutex_unlock(&LOCK_thread_count); pthread_create is called under critical section (LOCK_thread_count is released after that) If cached_thread_count > wake_thread, pthread_create is not called Increasing thread_cache_size will fix the problem!
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 47
Reading mysqld.cc:4811
Configuration Summary
Allocate swap space (approx half of RAM size) Set vm.swappiness = 0 and use O_DIRECT Set /sys/block/sdX/queue/scheduler = deadline or noop Filesystem Tuning
relatime (noatime) ext3: tune2fs O dir_index -c l i 0 xfs: nobarrier Make sure write cache with battery is enabled
Others
Make sure to allocate separate database partitions (/var/lib/mysql, /tmp) from root partition (/)
When database size becomes full, it should not affect Linux kernels
/etc/security/limits.conf
soft nofile 8192 hard nofile 8192
Contact:
E-mail: [email protected] Blog http://yoshinorimatsunobu.blogspot.com @matsunobu on Twitter
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 49
50