ZFS Evil Tuning Guide: From Siwiki
ZFS Evil Tuning Guide: From Siwiki
ZFS Evil Tuning Guide: From Siwiki
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
From Siwiki
Overview
Tuning is Evil
Tuning is often evil and should rarely be done.
First, consider that the default values are set by the people who know the most about the effects of the tuning on the
software that they supply. If a better value exists, it should be the default. While alternative values might help a given
workload, it could quite possibly degrade some other aspects of performance. Occasionally, catastrophically so.
Over time, tuning recommendations might become stale at best or might lead to performance degradations.
Customers are leery of changing a tuning that is in place and the net effect is a worse product than what it could be.
Moreover, tuning enabled on a given system might spread to other systems, where it might not be warranted at all.
Nevertheless, it is understood that customers who carefully observe their own system may understand aspects of
their workloads that cannot be anticipated by the defaults. In such cases, the tuning information below may be
applied, provided that one works to carefully understand its effects.
If you must implement a ZFS tuning parameter, please reference the URL of this document:
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
Review ZFS Best Practices Guide
On the other hand, ZFS best practices are things we encourage people to use. They are a set of recommendations
that have been shown to work in different environments and are expected to keep working in the foreseeable future.
So, before turning to tuning, make sure you’ve read and understood the best practices around deploying a ZFS
environment that are described here:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
Identify ZFS Tuning Changes
The syntax for enabling a given tuning recommendation has changed over the life of ZFS releases. So, when
upgrading to newer releases, make sure that the tuning recommendations are still effective. If you decide to use a
tuning recommendation, reference this page in the /etc/system file or in the associated script.
The Tunables
In no particular order:
CHECKSUMS
ARCSIZE
ZFETCH
VDEVPF
MAXPEND
FLUSH
ZIL
METACOMP
Tuning ZFS Checksums
End-to-end checksumming is one of the great features of ZFS. It allows ZFS to detect and correct many kinds of
errors other products can’t detect and correct. Disabling checksum is, of course, a very bad idea. Having file system
level checksums enabled can alleviate the need to have application level checksums enabled. In this case, using the
ZFS checksum becomes a performance enabler.
The checksums are computed asynchronously to most application processing and should normally not be an issue.
However, each pool currently has a single thread computing the checksums (RFE below) and it’s possible for that
computation to limit pool throughput. So, if disk count is very large (>> 10) or single CPU is weak (< Ghz), then this
tuning might help. If a system is close to CPU saturated, the checksum computations might become noticeable. In
those cases, do a run with checksums off to verify if checksum calculation is a problem.
If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_ZFS_Checksums
Verify the type of checksum used:
And reverted:
Fletcher2 checksum (the default) has been observed to consume roughly 1Ghz of a CPU when checksumming 500
MByte per second.
RFEs
6533726 single-threaded checksum & raidz2 parity calculations limit write bandwidth on thumper (Fixed in
Nevada, build 79 and Solaris 10 10/08)
If a future memory requirement is significantly large and well defined, then it can be advantageous to
prevent ZFS from growing the ARC into it. For example, if we know that a future application requires 20%
of memory, it makes sense to cap the ARC such that it does not consume more than the remaining 80% of
memory.
If the application is a known consumer of large memory pages, then again limiting the ARC prevents ZFS
from breaking up the pages and fragmenting the memory. Limiting the ARC preserves the availability of
large pages.
If dynamic reconfiguration of a memory board is needed (supported on certain platforms), then it is a
requirement to prevent the ARC (and thus the kernel cage) to grow onto all boards.
If an application’s demand for memory fluctuates, the ZFS ARC caches data at a period of weak demand
and then shrinks at a period of strong demand. However, on large memory systems, ZFS does not shrink
below the value of arc_c_min or currently, at approximately 12% of memory. If an application’s height of
memory usage requires more than 88% of system memory, tuning arc_c_min would be currently required
until a better default is selected as part of 6855793.
For theses cases, you might consider limiting the ARC. Limiting the ARC will, of course, also limit the amount of
cached data and this can have adverse effects on performance. No easy way exists to foretell if limiting the ARC
degrades performance.
If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
For example, if an application needs 5 GBytes of memory on a system with 36-GBytes of memory, you could set the
arc maximum to 30 GBytes, (0x780000000 or 32212254720 bytes). Set the zfs:zfs_arc_max parameter in the
/etc/system file:
arc.c = arc.c_max
arc.p = arc.c / 2
For example, to the set the ARC parameters to small values, such as arc_c_max to 512MB, and complying with the
formula above (arc.c_max to 512MB, and arc.p to 256MB), use the following syntax:
# mdb -kw
ffffffffc00b3260 p = 0xb75e46ff
ffffffffc00b3268 c = 0x11f51f570
You can also use the arcstat script available at http://blogs.sun.com/realneel/entry/zfs_arc_statistics to check the arc
size as well as other arc statistics
Here is a perl script that you can call from an init script to configure your ARC on boot with the above guidelines:
#!/bin/perl
use strict;
if ( !defined($arc_max) ) {
exit -1;
$| = 1;
use IPC::Open2;
my %syms;
my $mdb = "/usr/bin/mdb";
while(<READ>) {
my $line = $_;
$syms{$2} = $1;
last;
RFEs
6488341 ZFS should avoiding growing the ARC into trouble (Fixed in Nevada, build 107)
6522017 The ARC allocates memory inside the kernel cage, preventing DR
6424665 ZFS/ARC should cleanup more after itself
6429205 Each zpool needs to monitor it’s throughput and throttle heavy writers (Fixed in Nevada, build 87
and Solaris 10 10/08) For more information, see this link: New ZFS write throttle
6855793 ZFS minimum ARC size might be too large
Further Reading
http://blogs.sun.com/roch/entry/does_zfs_really_use_more
http://blogs.sun.com/realneel/entry/zfs_arc_statistics
File-Level Prefetching
ZFS implements a file-level prefetching mechanism labeled zfetch. This mechanism looks at the patterns of reads to
files, and anticipates on some reads, reducing application wait times. The current code needs attention (RFE below)
and suffers from 2 drawbacks:
Sequential read patterns made of small reads very often hit in the cache. In this case, the current code
consumes a significant amount of CPU time trying to find the next I/O to issue, whereas performance is
governed more by the CPU availability.
The zfetch code has been observed to limit scalability of some loads.
So, if CPU profiling, by using lockstat(1M) with -I argument or er_kernel as described here:
http://developers.sun.com/prodtech/cc/articles/perftools.html
shows significant time in zfetch_* functions, or if lock profiling (lockstat(1m)) shows contention around zfetch locks,
then disabling file level prefetching should be considered.
Disabling prefetching can be achieved dynamically or through a setting in the /etc/system file.
If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#File-Level_Prefetching
Set dynamically:
Revert to default:
set zfs:zfs_prefetch_disable = 1
Revert to default:
set zfs:zfetch_array_rd_sz = 0
RFEs
6412053 zfetch needs some love
6579975 dnode_new_blkid should first check as RW_READER (Fixed in Nevada, build 97)
Device-Level Prefetching
ZFS does device-level read-ahead in addition to file-level prefetching. When ZFS reads a block from a disk, it inflates
the I/O size, hoping to pull interesting data or metadata from the disk. This data is stored in a 10MB LRU per-vdev
cache which can short-cut the ZIO pipeline if present in cache.
Prior to the Solaris Nevada build snv_70, the code caused problems for system with lots of disks because the extra
prefetched data could cause congestion on the channel between the storage and the host. Tuning down the size by
which I/O was inflated () had been effective for OLTP type loads in the past. However, fixed by bug 6437054, the
code is now only prefetching metadata and this is not expected to require any tuning.
If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device-Level_Prefetching
No tuning is required for Solaris Nevada releases, build 70 and after.
Solaris 10 (up to Solaris 10 5/08) and Nevada (up to build 70) Releases
Setting this tunable might only be appropriate in the Solaris 10 8/07 and Solaris 10 5/08 releases and Nevada
releases from build 53 to build 69.
set zfs:zfs_vdev_cache_bshift = 13
/* Comments
RFEs
6437054 vdev_cache wises up: increase DB performance by 16% (Fixed in Nevada, build 70 and Solaris 10
10/08)
Further Reading
http://blogs.sun.com/erickustarz/entry/vdev_cache_improvements_to_help
The Solaris Nevada release now has the option of storing the ZIL on separate devices from the main pool. Using
separate intent log devices can alleviate the need to tune this parameter for loads that are synchronously write
intensive.
If you tune this parmeter, please reference this URL in shell script or in an /etc/system comment.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Conc
urrency.29
Tuning is not expected to be effective for NVRAM-based storage arrays.
Revert to default:
set zfs:zfs_vdev_max_pending = 10
http://blogs.sun.com/roch/entry/tuning_the_knobs
RFEs
6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops
Further Reading
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
Cache Flushes
If you’ve noticed terrible NFS or database performance on SAN storage array, the problem is not with ZFS, but with
the way the disk drivers interact with the storage devices.
ZFS is designed to work with storage devices that manage a disk-level cache. ZFS commonly asks the storage
device to ensure that data is safely placed on stable storage by requesting a cache flush. For JBOD storage, this
works as designed and without problems. For many NVRAM-based storage arrays, a problem might come up if the
array takes the cache flush request and actually does something rather than ignoring it. Some storage will flush their
caches despite the fact that the NVRAM protection makes those caches as good as stable storage.
ZFS issues infrequent flushes (every 5 second or so) after the uberblock updates. The problem here is fairly
inconsequential. No tuning is warranted here.
ZFS also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, and
so on). The completion of this type of flush is waited upon by the application and impacts performance. Greatly so, in
fact. From a performance standpoint, this neutralizes the benefits of having an NVRAM-based storage.
Contact you storage vendor for instructions on how to tell the storage devices to ignore the cache flushes sent by
ZFS. For Santricity based storage devices, instructions are documented in CR 6578220.
If you are not able to configure the storage device in an appropriate way, the preferred mechanism is to tune sd.conf
specifically for your storage. See the instructions below.
As a last resort, when all LUNs exposed to ZFS come from NVRAM-protected storage array and procedures ensure
that no unprotected LUNs will be added in the future, ZFS can be tuned to not issue the flush requests by setting
zfs_nocacheflush. If some LUNs exposed to ZFS are not protected by NVRAM, then this tuning can lead to data loss,
application level corruption, or even pool corruption. In some NVRAM-protected storage arrays, the cache flush
command is a no-op, so tuning in this situation makes no performance difference.
NOTE: Cache flushing is commonly done as part of the ZIL operations. While disabling cache flushing can, at times,
make sense, disabling the ZIL does not.
If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes
NOTE: If you are carrying forward an /etc/system file, please verify that any changes made still apply to your current
release. Help us rid the world of /etc/system viruses.
Template:Draft
A recent fix is that the flush request semantic has been qualified to instruct storage devices to ignore the requests if
they have the proper protection. This change required a fix to our disk drivers and for the storage to support the
updated semantics.
If the storage device does not recognize this improvement, here are instructions to tell the Solaris OS not to send any
synchronize cache commands to the array. If you use these instructions, make sure all targetted LUNS are indeed
protected by NVRAM.
Caution: All cache sync commands are ignored by the device. Use at your own risk.
1. Use the format utilty to run the inquiry subcommand on a LUN from the storage array. For example:
2. # format
3. .
4. .
5. .
7. format> inquiry
8. Vendor: ATA
format>
nvcache1=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;
sd driver (X64 and a few SPARC FC drivers): Add similar lines to the /kernel/drv/sd.conf file
sd-config-list = "ATA Super Duper ", "nvcache1";
nvcache1=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;
Note: In the above examples, nvcache1 is just a token in sd.conf. You could use any similar token.
13. Add whitespace to make the vendor ID (VID) 8 characters long (here “ATA “) and Product ID (PID) 16
characters long (here “Super Duper “) in the sd-config-list entry as illustrated above.
14. After the sd.conf or ssd.conf modifications and reboot, you can tune zfs_nocacheflush back to it’s default
value (of 0) with no adverse effect on performance.
Template:Draft
For more cache tuning resource information, see:
http://blogs.digitar.com/jjww/?itemid=44.
http://forums.hds.com/index.php?showtopic=497.
Current Solaris 10 Releases and Solaris Nevada Releases
Starting in the Solaris 10 5/08 release and Solaris Nevada build 72 release, the sd and ssd drivers should properly
handle the SYNC_NV bit, so no changes should be needed.
Revert to default:
set zfs:zfs_nocacheflush = 1
Risk: Some storage might revert to working like a JBOD disk when their battery is low, for instance. Disabling the
caches can have adverse effects here. Check with your storage vendor.
set zfs:zil_noflush = 1
Set dynamically:
Revert to default:
RFEs
6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices (Fixed
in Nevada, build 74 and Solaris 10 5/08)
6460889 zil shouldn’t send write-cache-flush command to busted devices
Disabling the ZIL (Don’t)
ZIL stands for ZFS Intent Log. It is used during synchronous writes operations. The ZIL is an essential part of ZFS
and should never be disabled. Significant performance gains can be achieved by not having the ZIL, but that would
be at the expense of data integrity. One can be infinitely fast, if correctness is not required.
One reason to disable the ZIL is to check if a given workload is significantly impacted by it. A little while ago, a
workload that was a heavy consumer of ZIL operations was shown to not be impacted by disabling the ZIL. It
convinced us to look elsewhere for improvements. If the ZIL is shown to be a factor in the performance of a workload,
more investigation is necessary to see if the ZIL can be improved.
The OpenSolaris 2008 releases, Solaris 10 10/08 release, and Solaris Nevada build 68 or later release has the option
of storing the ZIL on separate log devices from the main pool. Using separate possibly low latency devices for the
Intent Log is a great way to improve ZIL sensitive loads. This feature is not currently supported on a root pool.
In general, negative ZIL performance impacts are worse on storage devices which have high write latency. HDD write
latency is on the order of 10-20 ms. Many hardware RAID arrays have nonvolatile write caches where the write
latency can be on the order of 1-10 ms. SSDs have write latency on the order of 0.2 ms. As the write latency
decreases, the negative performance affects are diminished, which is why using an SSD as a separate ZIL log is a
good thing. For hardware RAID arrays with nonvolatile cache, the decision to use a separate log device is less clear.
YMMV.
The size of the separate log device may be quite small. A rule of thumb is that you should size the separate log to be
able to handle 10 seconds of your expected synchronous write workload. It would be rare to need more than 100
MBytes in a separate log device, but the separate log must be at least 64 MBytes.
Caution: Disabling the ZIL on an NFS server can lead to client side corruption. The ZFS pool integrity itself is not
compromised by this tuning.
If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29
Note!: The zil_disable tunable is only evaluated during dataset mount. While this can be tuned dynamically, to reap
the benefits you must zfs umount and then zfs mount (or reboot, or export/import the pool, etc).
RFEs
6280630 zil synchronicity
Further Reading
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_onhttp://blogs.sun.com/erickustarz/entry/zil_disablehttp://blo
gs.sun.com/roch/entry/nfs_and_zfs_a_fine
With ZFS, compression of data blocks is under the control of the file system administrator and can be turned on or
off by using the command “zfs set compression …”.
On the other hand, ZFS internal metadata is always compressed on disk, by default. For metadata intensive loads,
this default is expected to gain some amount of space (a few percentages) at the expense of a little extra CPU
computation. However, a bigger motivation exists to have metadata compression on. For directories that grow to
millions of objects then shrink to just a few, metadata compression saves large amounts of space (>>10X).
In general, metadata compression can be left as is. If your workload is CPU intensive (say > 80% load) and kernel
profiling shows medata compression is a significant contributor and we are not expected to create and shrink huge
directories, then disabling metadata compression can be attempted with the goal of providing more CPU to handle
the workload.
If you tune this parameter, please reference this URL in shell script or in an /etc/system comment.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_Metadata_Compression
Set dynamically:
Revert to default:
set zfs:zfs_mdcomp_disable = 1
RFEs
6391873 metadata compression should be turned back on (Fixed in Nevada, build 36)
Additional ZFS References
ZFS Best Practices
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
ZFS Dynamics
http://blogs.sun.com/roch/entry/the_dynamics_of_zfs
ZFS Links
http://opensolaris.org/os/community/zfs/links/
Er_kernel profiling
http://developers.sun.com/prodtech/cc/articles/perftools.html
ZFS and Database/OLTP
http://blogs.sun.com/realneel/entry/zfs_and_databases
http://blogs.sun.com/realneel/entry/zfs_and_databases_time_for
http://blogs.sun.com/roch/entry/zfs_and_oltp
ZFS and NFS
http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine
ZFS and Direct I/O
http://blogs.sun.com/roch/entry/zfs_and_directio
ZFS Separate Intent Log (SLOG)
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
Integrated RFEs that introduced or changed tunables
snv_51 : 6477900 want more /etc/system tunables for ZFS performance analysis
snv_52 : 6485204 more tuneable tweakin
snv_53 : 6472021 vdev knobs can not be tuned
Share this:
Print
LinkedIn
Facebook
Twitter
More
Like this:
Loading...
Related Posts: