Your Disk Array Is Slower Than It Should Be
Your Disk Array Is Slower Than It Should Be
Paul Tuckfield
I'm talking OLTP, and just IO
• You need to make sure you've done
other things first, and have minimized
your IO load, and arrived at the point
where you must improve IO performance
• You bought a multidisk controller, you
should make sure it's actually using
multiple disks concurrently.
4 ways to make OLTP IO faster
0.) Do not do IO for the query
1.) Do IO only for the rows the query needs
2.) if you must do lots of IO, do it sequentially (read
ahead) but only in the DB not in the fs or raid
3.) make sure you're really doing concurrent IO to
multiple spindles (many things can prevent this)
digression: Sometimes it's faster to stream a whole table from disk, ignore most of
the blocks, even tho an index might have gotten ony the blocks you want. Where is
the tradeoff point? in OLTP almost never, because it flushes caches
4 ways to make OLTP IO faster
Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
FS cache 7 | 8 | 9 | 10 | 11
RAID cache 10 | 11
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
Only DB should cache reads
Suppose a read of any block: If it's in any of these caches, it will also be in the
DB cache, because the db cache is the largest. therefore the read caches of
the file system and raid controler will never be used usefully on a cache hit.
Suppose it's an InnoDB cache miss on the other hand: it won't be in fs or raid
cache then, since they're smaller and cant have a longer LRU list than
InnoDB. therefore, they're useles.
Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
FS cache 7 | 8 | 9 | 10 | 11
RAID cache 10 | 11
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
Only the DB should readahead
• the db knows or at least has far better heuristic
data to know when it needs to read ahead( Innodb
method is questionable IMO)
• filesystem, raid are making much weaker guesses
if they do readahead, it's more or less hard coded,
not good. nominal case should be a single block.
• When they do readahead, they're essentially doing
even more read caching, which we've established
is useless or even bad in these layers.
• Check “avgrq-sz” in iostat to see if some upper
layer is doing > 1 block (16k in the case of innodb)
read
Only cache writes in controller
suppose after reading 0 thru 11, the db modifies blocks 10 and 11.
the blocks are written to the FS cache then to the raid cache. Until
acknolegments come back, theyre “dirty pages” signified in red
Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
FS cache 7 | 8 | 9 | 10 | 11
RAID cache 10 | 11
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 |11| 12 | 13 | 14 | 15
Only cache writes in controller
When the raid puts the blocks in ram, it tells the FS that the block is
written, likewise the fs tells the db it's written. Both layers then
consider the page to be “clean” signified by turning them black here.
Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
FS cache 7 | 8 | 9 | 10 | 11
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 |11| 12 | 13 | 14 | 15
Only cache writes in controller
The database can burst lots of writes to the raid cache before they ever get
committed to disk, which is good: the raid offloads all the waiting and
housekeeping of writing to persistent storage. Now Suppose a cache
miss occurs requesting block 14. It can’t be put into cache until block 10 is
written, and the LRU is “pushed down”
Db cache 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
FS cache 7 | 8 | 9 | 10 | 11
RAID cache 10 | 11
Db cache 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 14
FS cache 8 | 9 | 10 | 11 |14
RAID cache 11 |
disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
Only cache writes in controller
Now, 14 can be put into the raid cache, a cost of 1 more
IO, before sending on to the host
Db cache 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 14
FS cache 8 | 9 | 10 | 11 |14
RAID cache 11 | 14
disk 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11| 12 | 13 | 14 | 15
This is the ideal I seek : no filesystem cache at all, and
only caching writes on the raid. In the same scenario,
you can see how the read of block 2 will not serialize
behind the writes of block 10.
Db cache 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11|
FS cache
10 | 11 RAID cache
Now, we've already established that the read cache is useless, because any
cache hit will happen in the upper , much larger layers. If dirty pages are at
the tail of an LRU cache, that means any reads will also have to wait for the
dirty page to be written before proceeding (filling a cache buffer in the raid
controler before passing up to the host) therefore, it's great to cache writes
as long as you do not cache reads in the controler.
Some controlers (I think emc for example) may partition dirty and clean
pages, and also partition by spindles, so that reads still dont serialize behind
writes and no one spindle “borrows” too much bandwidth, leaving a cache
full of dirty pages that must be retired to only one spindle for eample. but
again, we've established that raid read cache is useless in a nominal setup,
cause it's always smaller than the host side cache, therefore may as well just
give all the space to writes.
Only cache writes in controller
• Burst writes “borrow” bandwidth by writing to raid
cache
• at some point, new Ios cannot continue, because
controller must “pay back” IO freeing up cache
slots.
• if reads don't go thru this cache they never
serialize here. Some controllers partition, which is
good enough.
• Various caching algorithms vary this somewhat,
but in the end they at best are no better than not
caching reads, at worst, hurt overall performance
Make stripe size >> blocksize
• suppose you read a 16k block from a
striped (or raid10) with a small
chunksize. You “engage” every spindle
to service one IO.
Db cache 0 | 1 | 2 | 6 | 7 | 8 | 9 | 10 | 11
FS cache
10 | 11 RAID cache
t0 t1
Concurrent vs serialized: both
80% busy
when you see a multidisk array at 80%, you have to make sure the
IO/sec and response times crosscheck with the % busy figure.
d1 80% busy
9 Ios 200ms
d2
d3
d4
d1
d2
d3
80% busy
d4 4 ios 200ms
Formula for detecting serialization:
(r/s+w/s) * svctm < 1 with
%busy near 100
• remember that svctm is in milliseconds, so divide by 1000
– ex 1 : (r/s + w/s)=9 so
• (9) * .200 = 1.8
• yes, concurrent Ios on the array
– ex 2: (r/s + w/s) = 4 so
• (4) * .200 = .8
• NO, the array is not doing concurrent IO
– BOTH AT 80% busy, but 1 is doing double the IO!
Example iostat -xkd
exhibiting serialization
This is real world iostat output on a system exhibiting some kind of
“false” serialization in the filesystem/kernel layer. I reproduced it
using rsync, so don't think this tells you anything about youtube scale
colums omitted for
readability
linux md