Aquilion PRIME - Start-Up Does Not Complete
Aquilion PRIME - Start-Up Does Not Complete
Phenomenon
During installation of this new Aquilion PRIME it is observed that most of the time the system
cannot complete the start-up. The system might complete the start-up, but then occasionally
hangs-up during operation.
The service manual has only limited instructions for faultfinding the console section. The results of
HWtest and HWdiag are unknown.
Inspections done:
- Checked LEDs at the front of the four servers: OK
- Checked LEDs at the back of the servers: seemed normal
- Checked RDD LCD panel: status "Ready"
- Checked LEDs at the front of the RDD: all normal
- Checked LEDs at the back of the RDD controllers: OK
- Collected logs, screenshots and RTM server boot recording, for reporting purposes
- Checked IP addresses at Recon box with "Colasoft": all OK
- When the problem exists, the Infinband connection blue lights flash On / Off
- The SS/ADIS has no heart beat.
Support was sought, logs were sent. Support made the following recommendations:
- Engineering needs to review the logs sent
- Check the IPMI card in RTM/RECON1
- Check the LAN connection of RTM/RECON1
- Check the network switch in the REC box
- Check or replace the FC card of RTM/RECON1
- Check or replace the server RTM/RECON1
The errlog.RTMsummary shows some abnormalities. Figure 2. All servers seem to have the
following three errors:
Engineering had the logs reviewed by TMRU. The suspicion is on the Infiniband environment. It is
recommended to check the Infiniband cables, all servers and Infiniband Switch. Replace the
Infiniband switch if possible. See the appendix for details on Engineering's analysis.
The problem persists. The system appears to be able to start up successfully only two out of ten
attempts. (20% success).
As all faultfinding means seemed to have been exhausted, and as the pressure to complete the
installation of the system on time was rising, it was decided to replace the entire Rec box. After
replacement of the entire rec box, the problem did not occur anymore. The system was restarted at
least eight times without any abnormalities.
Solution
Replacement of the Rec box.
Notes
This solution was chosen after exhausting all available faultfinding scenarios and due to time
pressure to complete the installation at the site. It was decided to further try to isolate the defective
component at the local head office, where an Aquilion PRIME system could be made available for
testing purposes.
Results:
It proved to be possible to reproduce the problem with the original Rec box installed in the system
at the local head office. Further inspections revealed that the Infiniband cable (CRS16) between
the RTM server and the Infiniband switch was causing the problems. The sealing on one connector
looks remarkably abnormal (figure 3).
Server 0
--------
Error Code # 0 >
-----------------------------------------
- TMRU Error (0x9b02c000)
- SW_Entity = Startup Task
- ErrorCode = 0x00000002
- ErrorDescription = StartupTask initialization failed
- FirstSubsystem = CPU Server
- FirstServer = 0
-----------------------------------------
Open ends
- Decent faultfinding scenarios are not available in the service manual
Appendix Engineering's analysis of the logs
Checked the opensm.log next (comparison between normal situation and problem case in UK):
<NORMAL CASE>
May 25 08:25:34 988820 [0001] 0x03 -> OpenSM 3.2.1-0bc7db2
May 25 08:25:34 988820 [0001] 0x80 -> OpenSM 3.2.1-0bc7db2
May 25 08:25:34 988820 [0001] 0x02 -> osm_vendor_init: 1000 pending umads specified
May 25 08:25:34 989820 [0001] 0x80 -> Entering DISCOVERING state
May 25 08:25:35 045811 [0001] 0x02 -> osm_vendor_bind: Binding to port 0x2c902002a4129
May 25 08:25:35 215785 [0001] 0x02 -> osm_vendor_bind: Binding to port 0x2c902002a4129
May 25 08:25:35 231783 [000C] 0x80 -> Entering MASTER stat
The last three lines, marked red in the normal case, are missing in the file from the UK, meaning
that OpenSM did not work. It is not clear which hardware is abnormal at that time.
OpenSM is the "Subnet Manager" software for the Infiniband network. This software is running in
the RTM. This software is required because the Infiniband switch in the current product (PRIME)
does not have the built-in "management" function.
The originally used Infiniband switch in Aquilion ONE had a built-in management function; i.e. it
was a "managed” switch.
The management function is required for Infiniband to control the communication routes between
the servers.