0% found this document useful (0 votes)
808 views10 pages

Troubleshoot DIMM Memory Issues in UCS

The document describes troubleshooting memory module issues in Cisco UCS. It covers memory placement, checking errors in UCSM and CLI, relevant log files, and DIMM blacklisting. Correctable errors are now treated differently than uncorrectable errors.

Uploaded by

Santosh Lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
808 views10 pages

Troubleshoot DIMM Memory Issues in UCS

The document describes troubleshooting memory module issues in Cisco UCS. It covers memory placement, checking errors in UCSM and CLI, relevant log files, and DIMM blacklisting. Correctable errors are now treated differently than uncorrectable errors.

Uploaded by

Santosh Lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Contents

Introduction
Prerequisites
Requirements
Components Used
Troubleshoot Methodology
Terms and Acronyms
Memory Placement
Memory Errors
Correctable vs. Uncorrectable Errors
Troubleshooting DIMM’s via UCSM and CLI
To Check Errors from GUI
To Check Errors from CLI
Log Files to Check in Tech Support
DIMM Blacklisting
Methods to Clear DIMM Blacklisting Errors
UCSM GUI
UCSM CLI
Related Information
Notable Bugs

Introduction
This document describes how to troubleshoot memory modules related issues in Cisco Unified
Computing System (UCS) solution. UCS usesDual In-line Memory Module (DIMM) as RAM
modules.

Prerequisites
Requirements

Cisco recommends that you have knowledge of Cisco Unified Computing System (Cisco UCS).

Components Used

This document is not restricted to specific software and hardware versions.

However, this document focus around

● Cisco UCS B-Series Blade Servers


● UCS Manager
The information in this document was created from the devices in a specific lab environment. All of
the devices used in this document started with a cleared (default) configuration. If your network is
live, make sure that you understand the potential impact of any command.
Troubleshoot Methodology
This section covers main parts of UCS memory issues.

● Memory placement
● Troubleshoot DIMM’s via UCSM and CLI
● Logs to check in tech support

Terms and Acronyms


DIMM Dual In-line Memory Module
ECC Error Correcting Code
LVDIMM Low Voltage DIMM
MCA Machine Check Architecture
MEMBIS
Memory Built-in Self Test
T
MRC Memory Reference Code
POST Power On Self Test
SPD Serial Presence Detect
DDR Double Data Rate
Reliability, Availability and
RAS
Serviceability

Memory Placement
Memory placement is probably one of the most notable physical aspects of UCS solution.
Typically the server comes with memory pre-populated with requested amount. However, when in
doubt refer to hardware installation guide, which should be updated regularly as new hardware is
introduced.

For memory population rules please refer to B-series technical specifications for the specific
platform.

B-series technical specifications link:

http://www.cisco.com/c/en/us/products/servers-unified-computing/ucs-b-series-blade-
servers/datasheet-listing.html

Memory Errors
● DIMM Error
●ECC(Error Correcting Code) Error
Multibit = Uncorrectable

●POST it is mapped out by BIOS, OS does not see DIMM


●Runtime usually causes OS reboot
Singlebit = Correctable

●OS continues to see memory, performance could degrade


●Parity Error
●SPD (Serial Presence Detect) Error
● Configuration Error
● Unpaired DIMMs
● Mismatch errors
● Not supported DIMMs
● Not supported DIMM population
● Identity unestablishable error
● Check and update the catalog

Correctable vs. Uncorrectable Errors


Whether a particular error is correctable or uncorrectable depends on the strength of the ECC
code employed within the memory system. Dedicated hardware is able to fix correctable errors
when they occur with no impact on program execution.

The DIMMs with correctable error are not disabled and are available for the OS to use. The Total
Memory and Effective Memory be the same (taking memory mirroring into account). These
correctable errors reported in UCSM operability state as Degraded while overall
operability Operable with correctable errors.

Uncorrectable errors generally cannot be fixed, and may make it impossible for the application or
operating system to continue execution. The DIMMs with uncorrectable error is disabled and OS
does not see that memory. UCSM operState change to ""Inoperable"" in this case.

Troubleshooting DIMM’s via UCSM and CLI


To Check Errors from GUI
UCSM Logs Description
DIMM Status Operability SEL Comments
Check SEL log for DIMM
Operable Operable A DIMM is installed and functional.
related errors
A correctable ECC DIMM error is detected during
Operable Degraded Check SEL for ECC errors
time.
Removed N/A No logs A DIMM is not installed or corrupted SPD data.
Check SEL for Identity
Disabled Operable Check and update capability catalog
unestablishable errors
Check SEL if another A DIMM may be healthy but disabled because
Disabled N/A DIMM in failed in the same configuration rule could not be maintained by a fa
channel DIMM in the same channel.
Failed to follow memory configuration rule becaus
Disabled N/A No logs
missing DIMMs.
Inoperable/R
Inoperable eplacement UE ECC Error was detected.
required
DIMM status and Operability changed due to ECC
Degraded Inoperable Check SEL for ECC errors
errors were detected before host rebooted.
Uncorrectable ECC error was detected during run
Inoperable/R DIMM remains available to OS, OS crashes and
Check SEL for ECC error
Degraded eplacement comes back up but still can use this DIMM. Error
during POST/MRC
required occur again later. DIMM should be replaced in mo
situations.
In order to obtain statistics navigate to Equipment > Chassis > Server > Inventory > Memory
and then Right click on memory and select show navigator.

To Check Errors from CLI


These commands are useful when troubleshooting errors from CLI.

scope server x/y -> show memory detail


scope server x/y -> show memory-array detail
scope server x/y -> scope memory-array x -> show stats history memory-array-env-stats detail
From memory array scope you can also get access to DIMM.

scope server X/Y > scope memory-array Z > scope DIMM N

From there then you can obtain per-DIMM statistics or reset the error counters.

bdsol-6248-06-B /chassis/server/memory-array/dimm # reset-errors


bdsol-6248-06-B /chassis/server/memory-array/dimm* # commit-buffer
bdsol-6248-06-B /chassis/server/memory-array/dimm # show stats memory-error-state
If you see a correctable error reported that matches the information above, the problem can be
corrected by resetting the BMC instead of reseating or resetting the blade server. Use these Cisco
UCS Manager CLI commands:

Resetting the BMC does not impact the OS running on the blade.

UCS1-A# scope server x/y


UCS1-A /chassis/server # scope bmc
UCS1-A /chassis/server/bmc # reset
UCS1-A /chassis/server/bmc* # commit-buffer
With UCSM releases 3.1 and 2.2.7, the thresholds for memory corrected errors have been
removed.

Therefore, memory modules (DIMM) shall no longer be reported as "Inoperable" or "Degraded"


solely due to corrected memory errors.

As per whitepaper http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-


computing/ucs-manager/whitepaper-c11-736116.pdf

Industry demands for greater capacity, greater bandwidth, and lower operating voltages lead to
increased memoryerror rates. Traditionally, the industry has treated correctable errors in the same
way as uncorrectable errors, requiring the module to be replaced immediately upon alert. Given
extensive research that correctable errors are not correlated with uncorrectable errors, and that
correctable errors do not degrade system performance, the Cisco UCS team recommends against
immediate replacement of modules with correctable errors. Customers who experience a
Degraded memory alert for correctable errors should reset the memory error and resume
operation. If you follow this recommendation, it avoids unnecessary server disruption. Future
enhancements to error management are coming and helps distinguish among various types of
correctable errors and identify the appropriate actions, if any, needed.

It is recommended to be minimum of version 2.1(3c) or 2.2(1b) which has enhancement with UCS
memory error management
If the above troubleshooting did not help please raise a support request for assistance.

Log Files to Check in Tech Support

UCSM_X_TechSupport > sam_techsupportinfo

Provides information about DIMM and memory array.

Chassis/server tech support

CIMCX_TechSupport\tmp\CICMX_TechSupport.txt -> Generic tech support information about sever X.


CIMCX_TechSupport\obfl\obfl-log -> OBFL logs provide an ongoing logs about status and boot of
server X.
CIMCX_TechSupport\var\log\sel -> SEL logs for server X.
Based on the platform/version, navigate to the files in tech support bundle

var/nuova/BIOS > RankMarginTest.txt

var/nuova/BIOS > MemoryHob.txt

var/nuova/BIOS > MrcOut_*.txt

These files provide information about memory as seen from BIOS level.

Information there can be cross-referenced again DIMM states reporting tables shown above.

Example:

/var/nuova/BIOS/RankMarginTest.txt

● Useful for showing the test results from BIOS


Training test

MEMBIST

● Look for errors


● Look to see if any DIMMs are mapped out
● show DIMM specific information (Vendor/speed/PID)
DIMM |GB|R|MfgDate|Mod ID |DRAM ID |Reg ID |CtW Tck CLS Taa V|Freq|Part#
A1 18| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
A2 26| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
B1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
B2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
C1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
C2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
D1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
D2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
E1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
E2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
F1 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
F2 01| 8|2|2009W48|Samsung|Samsung 00|Inphi 03|5550 0C 003C 69 0|1333|M393B1K70BH1-CH9
The first column has two values:
DIMM locator (F2)

DIMM status (01)

Here is a brief description for each status:

0x00 // Not Installed (No DIMM)

0x01 // Installed (Working)

//// 0x02-0F (Reserved)

//// Failed

0x10 // Failed Training

0x11 // Failed Clock Training

//// 0x12-17 (Reserved)

0x18 // Failed MemBIST

//// 0x19-1F (Reserved)

//// Ignored

0x20 // Ignored (Disabled from debug console)

0x21 // Ignored (SPD Error reported by BMC)

0x22 // Ignored (Non-RDIMM)

0x23 // Ignored (Non-ECC)

0x24 // Ignored (Non-x4)

0x25 // Ignored (Other PDIMM in same LDIMM failed)

0x26 // Ignored (Other LDIMM in same channel failed)

0x27 // Ignored (Other channel in LockStep or Mirror failed)

0x28 // Ignored (Invalid PDIMM population)

0x29 // Ignored (PDIMM Organization Mismatch)

0x2A // Ignored (PDIMM Register Vendor Mismatch)

//// 0x2B-7F (Reserved)

var/nuova/BIOS > MemoryHob.txt

shows effective and failed memory installed on the server


+++ BEGINNING OF FILE
Memory Speed = 1067 MHz
Memory Mode = 00
RAS Modes = 03
MRC Flags = 0000000A
Total Memory = 98304 MB
Effective Memory = 90112 MB
Failed Memory = 8192 MB
Ignored Memory = 0 MB
Redundant Memory = 0 MB
|---------------------------------|
| Memory | Channel | DIMM Status |
| Channel | Status | 1 2 |
|---------------------------------|
| A | 01 | 01 01 |
| B | 01 | 01 01 |
| C | 01 | 01 01 |
| D | 01 | 01 01 |
| E | 01 | 01 01 |
| F | 01 | 01 18 |
|---------------------------------|
18h - DIMM status is marked as failed when it fails in MemBist test. Replace with a known good
DIMM.

DIMM Status Description

00h Not Installed (No DIMM)

01h Installed (Working)

02h-0Fh Reserved

10h Failed (Training)

11h Failed (Clock training)

12h-17h Reserved

18h Failed (MemBIST)

19h-1Fh Reserved

20h Ignored (Disabled from debug console)

21h Ignored (SPD Error reported by BMC)

22h Ignored (Non-RDIMM)

23h Ignored (Non-ECC)

24h Ignored (Non-x4)

25h Ignored (Other PDIMM in same LDIMM failed)

26h Ignored (Other LDIMM in same channel failed)


27h Ignored (Other channel in LockStep or Mirror)

28h Ignored (Invalid memory population)

29h Ignored (Organization mismatch)

2Ah Ignored (Register vendor mismatch)

2Bh- 7Fh Reserved

80h Ignored ( Workaround -Looping)

81h Ignored (Stuck I2C bus)

82h – FFh Reserved

DIMM Blacklisting
In Cisco UCS Manager, the state of the Dual In-line Memory Module (DIMM) is based on SEL
event records. When the BIOS encounters a noncorrectable memory error during memory test
execution, the DIMM is marked as faulty. A faulty DIMM is a considered a nonfunctional device.

If you enable DIMM blacklisting, Cisco UCS Manager monitors the memory test execution
messages and blacklists any DIMMs that encounter memory errors in the DIMM SPD data. To
allow the host to map out any DIMMs that encounter uncorrectable ECC errors.

DIMM Blacklisting was introduced as an optional global policy in UCSM 2.2(2).

Server firmware must be 2.2(1)+ for B-series blades and 2.2(3)+ for C-series rack servers to
properly implement this feature.

In UCSM 2.2(4), the DIMM Blacklisting enabled by default.

Open the tech support file …/var/log/DimmBL.log

Open the file /var/nuova/BIOS/MrcOut.txt if it is available

Find the DIMM Status table. Look for “DIMM Status:”

DIMM Blacklisted = 1E

Find the DIMM Status table. Look for “DIMM Status:”

DIMM Status:

00 - Not Installed

01 - Installed

10 - Failed (Training failure)clear


1E - Failed (DIMM Blacklisted by BMC)

1F - Failed (SPD Error)

25 - Disabled (Other DIMM failed in same channel)

Example

DIMM Status:

|=======================|

| Memory | DIMM Status |

| Channel | 1 2 3 |

|=======================|

| A | 25 1F 25 |

| B | 01 01 01 |

| C | 1F 25 25 |

| D | 01 01 01 |

| E | 01 01 01 |

| F | 25 25 1E |

| G | 01 01 01 |

| H | 01 01 01 |

|=======================|

DIMM Status:

01 - Installed

1E - Failed (DIMM Blacklisted by BMC)

1F - Failed (SPD Error)

25 - Disabled (Other DIMM failed in same channel)

Methods to Clear DIMM Blacklisting Errors


UCSM GUI
UCSM CLI

UCS-B/chassis/server # reset-all-memory-errors

Related Information
● http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/ts/guide_old_FM/TS_Serv
er.html
● http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/gui/config/guide/2-
2/b_UCSM_GUI_Configuration_Guide_2_2/configuring_server_related_policies.html#co
ncept_2069B1145AAB47638CF9AFBB12198CEF
● https://www.cisco.com/c/dam/en/us/support/docs/servers-unified-computing/ucs-b-
series-blade-
servers/CiscoUCSEnhancedMemoryErrorManagementTechNoteFeb42015.pdf
● http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/ts/guide_old_FM/TS_Serv
er.html#wp1073848
● http://www.cisco.com/c/en/us/support/docs/field-notices/636/fn63651.html

Notable Bugs
Cisco Bug ID CSCug93076 B200M3-DDR voltage regulator may have excessive noise under light
load

Cisco Bug ID CSCup07488 IPMI DIMM fault sensor is setting Dimm Degraded with no error count.

Cisco Bug ID CSCud22620 Improved accuracy at identifying Degraded DIMMs

Cisco Bug ID CSCuw44524 C460M4, B260M4 or B460M4 IVB clear CMOS can cause memory
UECC Error

Cisco Bug ID CSCur19705 ECC/UECC Errors observed on B200M3

You might also like