Troubleshoot DIMM Memory Issues in UCS
Troubleshoot DIMM Memory Issues in UCS
Introduction
Prerequisites
Requirements
Components Used
Troubleshoot Methodology
Terms and Acronyms
Memory Placement
Memory Errors
Correctable vs. Uncorrectable Errors
Troubleshooting DIMM’s via UCSM and CLI
To Check Errors from GUI
To Check Errors from CLI
Log Files to Check in Tech Support
DIMM Blacklisting
Methods to Clear DIMM Blacklisting Errors
UCSM GUI
UCSM CLI
Related Information
Notable Bugs
Introduction
This document describes how to troubleshoot memory modules related issues in Cisco Unified
Computing System (UCS) solution. UCS usesDual In-line Memory Module (DIMM) as RAM
modules.
Prerequisites
Requirements
Cisco recommends that you have knowledge of Cisco Unified Computing System (Cisco UCS).
Components Used
● Memory placement
● Troubleshoot DIMM’s via UCSM and CLI
● Logs to check in tech support
Memory Placement
Memory placement is probably one of the most notable physical aspects of UCS solution.
Typically the server comes with memory pre-populated with requested amount. However, when in
doubt refer to hardware installation guide, which should be updated regularly as new hardware is
introduced.
For memory population rules please refer to B-series technical specifications for the specific
platform.
http://www.cisco.com/c/en/us/products/servers-unified-computing/ucs-b-series-blade-
servers/datasheet-listing.html
Memory Errors
● DIMM Error
●ECC(Error Correcting Code) Error
Multibit = Uncorrectable
●
The DIMMs with correctable error are not disabled and are available for the OS to use. The Total
Memory and Effective Memory be the same (taking memory mirroring into account). These
correctable errors reported in UCSM operability state as Degraded while overall
operability Operable with correctable errors.
Uncorrectable errors generally cannot be fixed, and may make it impossible for the application or
operating system to continue execution. The DIMMs with uncorrectable error is disabled and OS
does not see that memory. UCSM operState change to ""Inoperable"" in this case.
From there then you can obtain per-DIMM statistics or reset the error counters.
Resetting the BMC does not impact the OS running on the blade.
Industry demands for greater capacity, greater bandwidth, and lower operating voltages lead to
increased memoryerror rates. Traditionally, the industry has treated correctable errors in the same
way as uncorrectable errors, requiring the module to be replaced immediately upon alert. Given
extensive research that correctable errors are not correlated with uncorrectable errors, and that
correctable errors do not degrade system performance, the Cisco UCS team recommends against
immediate replacement of modules with correctable errors. Customers who experience a
Degraded memory alert for correctable errors should reset the memory error and resume
operation. If you follow this recommendation, it avoids unnecessary server disruption. Future
enhancements to error management are coming and helps distinguish among various types of
correctable errors and identify the appropriate actions, if any, needed.
It is recommended to be minimum of version 2.1(3c) or 2.2(1b) which has enhancement with UCS
memory error management
If the above troubleshooting did not help please raise a support request for assistance.
These files provide information about memory as seen from BIOS level.
Information there can be cross-referenced again DIMM states reporting tables shown above.
Example:
/var/nuova/BIOS/RankMarginTest.txt
MEMBIST
//// Failed
//// Ignored
02h-0Fh Reserved
12h-17h Reserved
19h-1Fh Reserved
DIMM Blacklisting
In Cisco UCS Manager, the state of the Dual In-line Memory Module (DIMM) is based on SEL
event records. When the BIOS encounters a noncorrectable memory error during memory test
execution, the DIMM is marked as faulty. A faulty DIMM is a considered a nonfunctional device.
If you enable DIMM blacklisting, Cisco UCS Manager monitors the memory test execution
messages and blacklists any DIMMs that encounter memory errors in the DIMM SPD data. To
allow the host to map out any DIMMs that encounter uncorrectable ECC errors.
Server firmware must be 2.2(1)+ for B-series blades and 2.2(3)+ for C-series rack servers to
properly implement this feature.
DIMM Blacklisted = 1E
DIMM Status:
00 - Not Installed
01 - Installed
Example
DIMM Status:
|=======================|
| Channel | 1 2 3 |
|=======================|
| A | 25 1F 25 |
| B | 01 01 01 |
| C | 1F 25 25 |
| D | 01 01 01 |
| E | 01 01 01 |
| F | 25 25 1E |
| G | 01 01 01 |
| H | 01 01 01 |
|=======================|
DIMM Status:
01 - Installed
UCS-B/chassis/server # reset-all-memory-errors
Related Information
● http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/ts/guide_old_FM/TS_Serv
er.html
● http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/gui/config/guide/2-
2/b_UCSM_GUI_Configuration_Guide_2_2/configuring_server_related_policies.html#co
ncept_2069B1145AAB47638CF9AFBB12198CEF
● https://www.cisco.com/c/dam/en/us/support/docs/servers-unified-computing/ucs-b-
series-blade-
servers/CiscoUCSEnhancedMemoryErrorManagementTechNoteFeb42015.pdf
● http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/ts/guide_old_FM/TS_Serv
er.html#wp1073848
● http://www.cisco.com/c/en/us/support/docs/field-notices/636/fn63651.html
Notable Bugs
Cisco Bug ID CSCug93076 B200M3-DDR voltage regulator may have excessive noise under light
load
Cisco Bug ID CSCup07488 IPMI DIMM fault sensor is setting Dimm Degraded with no error count.
Cisco Bug ID CSCuw44524 C460M4, B260M4 or B460M4 IVB clear CMOS can cause memory
UECC Error