CMG 2020 General Performance Recommendations
CMG 2020 General Performance Recommendations
2020 General
Performance Recommendations
cmgl.ca #FutureOfSimulation
20.CMG.07
Performance Recommendations
BIOS/UEFI
Adjust for Best Performance
Most manufacturers have a “Best Performance” default profile that can be set in the BIOS.
Start with setting the bios to the “Best Performance” default before moving to any other settings.
On Dell† Systems there is a setting for Memory Optimization called “MemOpMode” - set it to
OptimizerMode
If you require additional BIOS settings information, please contact the manufacturer directly.
When NUMA is enabled, the system tracks which memory is “local” to a CPU and will take
advantage of faster access times inherent with local memory (hence “non-uniform access”).
Otherwise, all memory is treated as non-local and no performance gains are realized
Performance Recommendations 1
Snoop Mode
On multi-socket machines, the Snoop Mode setting determines how processors monitor
changes made to memory within other processors cache lines. This can be important for CMG
software performance. CMG recommends using “Opportunistic Snoop” mode where available.
Note: There are several other options that can potentially impact performance, such as C-State
and Monitor/MWait. Implementations can vary, so please review the vendor’s
recommendations for best practices and other related settings to optimize CPU performance.
Operating System
In general, anything that improves the overall performance of the Operating System (E.g.
disabling unnecessary processes) will also improve CMG software performance. Performance
tuning, which increases the performance of one aspect of the OS or hardware at the expense of
another, can help but should be carefully tracked and tested. Please note, CMG cannot make
specific recommendations, as there are too many variables (different hardware, OS versions,
network environments, etc…). There are many available resources, such as the white paper
“Performance Tuning Guidelines for Windows† Server” or the “RHEL† Performance Tuning
Guide” for those particular Operating Systems.
Environment Variables
Beyond changes to your BIOS, there are two environment variables that affect how the CMG
simulators interact with your hardware: KMP_AFFINITY and OMP_SCHEDULE.
KMP_AFFINITY (only supported on Intel) tells the Operating System how to bind threads to
particular cores to take advantage of cached information in “local” memory for that core. As a
general guideline, set KMP_AFFINITY as follows:
KMP_AFFINITY has to be manually adjusted per job if multiple jobs are running on the same
machine, outside of a scheduling software such as HPC or LSF. CMG does not recommend this
scenario.
OMP_SCHEDULE affects the execution of loops within the program based on number of
threads available. For IMEX and STARS set OMP_SCHEDULE=static,1. This setting is not
necessary for GEM, as it is managed internally.
The functionality of these settings can be hardware dependent; consequently, the optimal
setting for a user’s specific setup may be different than above. These are only general
guidelines and should be tested in any new environment.
Performance Recommendations 2
Hardware
Note: these are only general guidelines only. Please contact CMG Sales for a detailed
description of the hardware recommended/used internally at CMG.
CPU
CMG software is very CPU-intensive and we always recommend CPUs with the best available
combination of clock speed and cache. The optimal number of cores is dependent on the
types of jobs, number of users, licensing, and other factors; talk to CMG Sales to determine the
optimal number for the specific situation.
Memory
Memory speed is an important factor in CMG performance and should be the fastest speed
available from the hardware vendor for a specific machine. During internal testing, CMG has seen
significant slowdowns with “low-voltage” RAM, which is occasionally found in lightweight laptops.
In addition, it is important that memory is properly balanced to take advantage of this speed. For
example, Cascade Lake processors, 6 DIMMs/CPU is the recommended configuration that can
be used to achieve the best performance. If decreased to lower than 6 DIMMs/channel, the
speed drops. There are a number of downloadable tools, such as CPU-Z, to confirm the most
effective memory speed.
Page file should be set to a fixed size rather than automatically managed by the Operating
System; this eliminates the overhead of managing the page file and prevents fragmentation.
Disk
Disk I/O is generally not a bottleneck for CMG software. However, if disk I/O is possibly
affecting performance, this can be easily tested by monitoring performance on the server using
a number of tools. Perfmon (Performance Monitor) in Windows† or iotop in Linux† are two
examples, but any standard monitoring tool should work.
Performance Recommendations 3
Additional Notes on KMP_AFFINITY
When CMG simulator jobs are run outside of a job scheduler and the KMP_AFFINITY
environment variable is set, it is possible that concurrent jobs may compete for the same
cores. For example: If KMP_AFFINITY is set to compact,0 without a job scheduler and run
two concurrent jobs, both jobs will start on core 0 on the first socket. Job schedulers usually
know which cores are available and will start the first job on the first core and the second job
on the first core that is not assigned to the first job to avoid over-subscription of resources.
For the same case, using KMP_AFFINITY=compact,1, the first thread of the first job will run on
core0 of the first socket, the second thread of the first job will run on core0 of the second socket
and so on. This causes memory traffic between the sockets and is not recommended.
However, this may be the best setting for single jobs, on a system that allows overclocking
(turbo mode) to optimize memory usage.
Performance Recommendations 4
This variable setting is becoming more important due to the newer multi-core systems now
available and needs further investigation. The proper setting for this variable depends upon
type of simulator, number of cores required for job and number of jobs per system being run
simultaneously.
Hyper-Threading Recommendation
For optimal performance, releases prior to 2018.10 recommended that hyper-threading be disabled. This
is no longer required, with the following considerations:
1. For optimal performance, a thread affinity setting should be set. (For example,
KMP_AFFINTY=compact,0 or KMP_AFFINTY=compact,1). Our testing has shown that not
employing this setting during simulation runs – via environment variable, or via job scheduler
configuration – can lead to material performance degradation vs. what is possible.
2. Extensive testing, using Intel processors, has shown that better performance is achieved when
hyper-threads are not used. This is now the default simulator behavior which means that the
number of threads (requested for the simulation) cannot exceed the number of physical cores,
unless the command line option ‘-htuse’ is used. GEM also allows use of keyword *HTUSE *ON
in the data file.
3. Furthermore, when the number of physical + hyper-thread cores in a machine is greater than 64,
the use of Linux_x64 executable is strongly recommended; the affinity setting is not effective (at
more than 64 cores) on numerous Windows OS variants.
4. The keyword has no effect on machines where hyper-threading is off. It also has no effect on
machines with operating systems that do not respond to thread-binding by setting
KMP_AFFINITY.
Hyper-threading effects could depend on processor type, hardware configuration and number of
jobs scheduled per node. Please consult your IT department and/or CMG support for further
guidance.
Performance Recommendations 5