SLES 1112 OS Tuning Amp Optimization Guide Part 1
SLES 1112 OS Tuning Amp Optimization Guide Part 1
suse.com /communities/blog/sles-1112-os-tuning-optimisation-guide-part-1/
Austin 11/6/2015
Joseph SLES 11/12: Memory, Disk/Storage IO Tuning and Optimization Part 1
This document is a basic SLES tuning guide for Memory and Disk I/O tuning and optimization. Many of the
parameters and settings discussed are Linux generic and can be applied. Refer to IHV/ISV Application tuning guides
or documentation before you implement the tuning parameters.
Before you start with tuning on the server make sure you create a backup of the current kernel settings
using “sysctl -A” :
sysctl -A >
/root/sysctl.settings.backup
Note: Some of the tuning parameters are configured aggressively to improve performance. Hence the settings
should never be applied in production environments without proper testing in designated test
environments.
1. Disable Transparent Huge Pages (THP:) On systems with large memory, frequent access to the
Translation Lookaside Buffer (TLB) may slow down the system significantly. Although THP can improve
performance for a large number of workloads, for workloads that rarely reference large amounts of memory, it
might regress performance. To disable THP boot the system with kernel parameter:
transparent_hugepage=never
or
2. Huge Pages: If the server is a heavily used application server, e.g. a Database, it would benefit significantly
by using Huge Pages. The default size of Huge Page in SLES is 2 MB, enabling Huge Pages would aid in
significant improvements for Memory Intensive Applications/Databases, HPC Machines, this configuration
needs to be done if the Applications support Huge Pages. If the Applications do not support Huge Pages then
1/6
configuring Huge Pages would result in wastage of memory as it cannot be used any further by the OS, by
Default no huge pages are allocated. Verify that the server has not allocated any Huge Pages via:
cat /proc/sys/vm/nr_hugepages
To allocate, e.g. 128 Huge Pages of size 2MB (allocating 256MB) you can pass on the parameter to kernel
via grub:
hugepages=128
Post reboot verify if 128 Huge Pages are allocated to server via:
cat
/proc/sys/vm/nr_hugepages
Another recommended method to configure Huge Pages in SLES is to install oracleasm rpm. In the file
/etc/sysconfig/oracle change the parameter from NR_HUGE_PAGES=0 to NR_HUGE_PAGES=128, e.g. if
you want to add 128 Huge Pages and restart oracle service.
Refer to the ISV documentation for best practice procedures of allocating Huge Pages. Allocating too many
Huge Pages may result in performance regressions under load.
3. Swap Improvements: If swap space is used, you should also have a look at the /proc/meminfo file, to co-
relate the use of swap to the amount of inactive anonymous(anon) memory pages. If the amount of used
swap is larger than the amount of anon memory pages that you observe in /proc/meminfo, it shows that
active memory is being swapped. This would degrade performance and may be addressed by installing more
RAM. If the amount of swap in use is smaller than the amount of inactive anon memory pages in
/proc/meminfo, then it would indicate good performance. If, however, you have more memory in swap than
the amount of inactive anonymous pages, then it would degrade performance because active memory is
being swapped. It would indicate too much I/O traffic, which might slow down your system.Swappiness:
Reducing the swappiness value to 25 from Default 60 would reduce the necessity of OS swapping memory
echo 25 >
and will maximize the use of memory on your server: /proc/sys/vm/swappiness
4. VFS caches: To reduce the rate at which VFS caches are reclaimed it would be good to reduce the
vfs_cache_pressure value from default 100 to 50, this variable controls the tendency of the kernel to
reclaim memory which is used for VFS Caches versus page cache and swap.
echo 50 >
/proc/sys/vm/vfs_cache_pressure
5. KSM: kernel 2.6.32 introduced Kernel Samepage Merging (KSM). KSM allows for an application to register
with the kernel so as to have its memory pages merged with other processes that also register to have their
pages merged. For KVM the KSM mechanism allows for guest virtual machines to share pages with each
other. In todays environment where many of the guest operating systems like XEN, KVM are similar and are
running on same host machine, this can result in significant memory savings, the default value is set to 0
2/6
6. Memory Overcommit: Every Linux process generally tries to claim more memory than it needs. The goal is
to make the process much faster. If the process does not have this excess memory then it needs to ask the
kernel to allocate more memory to the process leading to process getting slower if the request keeps
happening due to memory starvation.overcommit_memory is one parameter that can be tuned to improve
process memory. The default value in SLES is set to 0 which means the kernel checks to see if it has
memory available before granting more memory, the other two parameters available is 1 and 2 for
overcommit_memory. Changing it to 1 would make the system behave like it has all the memory it needs
without checking and 2 means the kernel would decline the memory request if it does not have the memory
available. Some applications tend to perform better if the system is tuned to behave that it has all the memory
that application process needs, but this can also lead to out of memory situations where the kernel OOM killer
gets invoked. Changing overcommit_memory can be done via:
echo 1 >
/proc/sys/vm/overcommit_memory
If the default value of 0 is selected for overcommit_memory, another way to improve performance is
changing the overcommit_ratio parameter from system default 50 percent to higher percentage value, An
overcommit_ratio value of 50 means the kernel can allocate 50% more memory of total memory available
(RAM + Swap). On a system with 8 GB RAM and 2 GB Swap the total amount of addressable memory would
be 15 GB for default overcommit percentage of 50. Changing overcommit_ratio to 70% would mean more
amount of memory would be available to overcommit to a process:
echo 70 >
/proc/sys/vm/overcommit_ratio
7. drop_caches: On system with huge amounts of RAM , when the server ends up utilising large amount of
RAM and starts swapping , its possible that though your application may not be using these RAM but Linux is
caching aggressively into its memory and even though the application needs memory it wont free some of
these cache but would rather start swapping.
To deal with such random issues , post kernel 2.6.16 and later releases have provided a non-destructive
mechanism for the kernel to drop page cache ,inode and dentry caches via drop_caches parameter , this
can get rid of tons of memory which remains unused but are not freed up by kernel for some reason short of
server reboot.
To free pagecache:
As mentioned this process is non-destructive and hence dirty objects are not freed , it would be desirable to
run “sync” command first before freeing up pagecache , dentries and inodes
3/6
1. Dirty Ratio: If there are performance issues observed with write performance on systems with large memory
(128GB+), change the memory percentage settings for dirty_ratio and dirty_background_ratio as
echo 10 >
documented in TID# 7010287. /proc/sys/vm/dirty_ratio
echo 5 >
/proc/sys/vm/dirty_background_ratio
TID# 7010287 Low write performance on SLES 11 servers with large RAM .
2. IO Scheduler: The default I/O scheduler for SLES is CFQ. It gives good performance for wide range of I/O
task but some I/O task can perform much better for certain type of hardware or applications like Database, To
Improve I/O performance for certain workloads noop or deadline scheduler may give better results.CFQ: CFQ
is a fairness-oriented scheduler and is used by default on SUSE Linux Enterprise. The algorithm assigns
each thread a time slice in which it is allowed to submit I/O to disk. This way each thread gets a fair share of
I/O throughput. It also allows assigning tasks I/O priorities which are taken into account during scheduling
decisions.
NOOP: The NOOP scheduler performs only minimal merging functions on your data. There is no sorting, and
therefore, this scheduler has minimal overhead. This scheduler was developed for non-disk-based block
devices, such as memory devices,SSD. It also does well on storage media that have extensive caching. In
some cases it can be helpful for devices that do I/O scheduling themselves, as intelligent storage, or devices
that do not depend on mechanical movement, like SSDs as NOOP scheduler has less over head it may
produce better performance for certain workloads.
Deadline: The deadline scheduler works with five different I/O queues and, therefore, is very capable of
making a difference between read requests and write requests. When using this scheduler, read requests will
get a higher priority. Write requests do not have a deadline, and, therefore, data to be written can remain in
cache for a longer period. This scheduler does well in environments in which a good read performance, as
well as a good write performance, is required, but shows some more priority for reads. This scheduler does
particularly well in database environments. You can use one of the scheduler at a time for system wide I/O
performance, check with your Hardware/Storage vendor on the ability of their storage system to manage the
I/O feature before you activate the noop scheduler.
a) To enable I/O Scheduler system wide during boot time, add the following parameter to /boot/grub/menu.lst:
elevator=noop or
elevator=deadline
b) To enable specific scheduler for certain block device you can echo a new value to the:
/sys/block/<device
name>/queue/scheduler
4/6
Enabling specific scheduler allows you to run optimized I/O workloads for specific Block Devices on your
Server depending on the kind of workload it is running. E.g. if your Database if located on block device sdg
you can enable deadline scheduler for block device sdg while the rest of OS continue on default CFQ or
NOOP IO scheduler.
3. Improving I/O Reads: Though Deadline Scheduler balances out between Read and Writes with a little
biased towards read, The OS can further be optimized for read request for certain type of applications on per
disk basis using read_ahead_kb and nr_request parameters. The kernel can detect when an application is
reading data sequentially from a file or from disk. In such scenario it performs an intelligent read-ahead
algorithm whereby more data than is requested by the user is read from disk, when the user does the next
attempt to read data the kernel does not have to go and fetch it but is already available in the page cache
improving read performance. On a default Linux installation the read_ahead_kb value is set to 128 or 512.
This can certainly be improved to make read performance much better and set it to 1024 or 2048 for server
with fast disks. For device mapper devices the value can be set to as high as 8192 due to reason that device
mapper has multiple underlying devices, 1024 should be a good starting point though for tuning.Similarly
nr_request default value is set to 128, every request queue has a limit on the total number of request
descriptors that can be allocated for each of read and write I/Os. Which mean with default value set to 128,
only 128 read and write request can be queued at a time before putting process to sleep. To get better read
performance you can set nr_request value to 1024, but increasing the value too high might introduce latency
and degrade write performance. For latency sensitive applications the converse is also true and nr_request
value must be set to lower then default 128 in some case as low as 1 so that writeback I/O cannot allocate all
of the available request descriptors and fill up the device queue with write I/O. To change the value of
read_ahead_kb and nr_request try:
4. File System Optimization: All modern filesystem on Linux use some kind of journaling mechanism to get the
right balance between date safety and performance. The default journal mode on SLES is data=ordered
which ensures data is written to disk before it is written to journal. To improve performance of filesystem in
context of speed and safety of data data=writeback can be used. This option ensures internal filesystem
integrity but it doesn’t guarantee new files are written to disk. Another option is to disable barrier support. In
case of power failure or crash the filesystem would have to run a filesystem check to repair the structure.
Even with barrier enabled most filesystems would still run a fsck in case of power failure or crash. Mounting
the filesystem with data=writeback, barrier=0 would improve filesystem performance at cost of some
reliability. Generally data=writeback should be good enough to add some performance gain without disabling
barrier.For example: Editing /etc/fstab
5. Kdump: Configure Kdump on your server and make sure it is tested to work via keyboard parameter
documented in the TID 3374462. Kdump though not directly related to optimization of the server would be
useful in server crash or hang situation where a kernel memory dump would be needed to analyse the root
cause of crash/hang of OS. For server with large amount of RAM 512 GB – 1 TB and above, with sufficient
disk space available to collect memory dump a reasonable value for KDUMP_DUMPLEVEL in
/etc/sysconfig/kdump file would be 15, which balances the need to capture maximum data from dump
keeping dump size reasonable when used with default compression enabled in kdump.
5/6
TID# 3374462 Configure kernel core dump capture
6/6