0% found this document useful (0 votes)
338 views6 pages

Aquilion PRIME - Start-Up Does Not Complete

Uploaded by

spahiuana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
338 views6 pages

Aquilion PRIME - Start-Up Does Not Complete

Uploaded by

spahiuana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

20110719

Aquilion PRIME - Start-up does not complete


V4.71ER004

Phenomenon
During installation of this new Aquilion PRIME it is observed that most of the time the system
cannot complete the start-up. The system might complete the start-up, but then occasionally
hangs-up during operation.

Figure 1 The system may not complete the start-up

Observations and interventions


Switching OFF the power and turn ON again after about five minutes usually makes the system
start up normally. However the problem comes back in a few hours.

The service manual has only limited instructions for faultfinding the console section. The results of
HWtest and HWdiag are unknown.

Inspections done:
- Checked LEDs at the front of the four servers: OK
- Checked LEDs at the back of the servers: seemed normal
- Checked RDD LCD panel: status "Ready"
- Checked LEDs at the front of the RDD: all normal
- Checked LEDs at the back of the RDD controllers: OK
- Collected logs, screenshots and RTM server boot recording, for reporting purposes
- Checked IP addresses at Recon box with "Colasoft": all OK
- When the problem exists, the Infinband connection blue lights flash On / Off
- The SS/ADIS has no heart beat.

Parts replaced, without change to the phenomenon:


- One new server was subsequently installed in all slots
- FC card in Server and S-con
- Infiniband card in ALL server locations
- SPIF board

Support was sought, logs were sent. Support made the following recommendations:
- Engineering needs to review the logs sent
- Check the IPMI card in RTM/RECON1
- Check the LAN connection of RTM/RECON1
- Check the network switch in the REC box
- Check or replace the FC card of RTM/RECON1
- Check or replace the server RTM/RECON1

The errlog.RTMsummary shows some abnormalities. Figure 2. All servers seem to have the
following three errors:

- StartupTask initialization failed


- DCP initialization failed
- RocketLink driver initialization failure

Engineering had the logs reviewed by TMRU. The suspicion is on the Infiniband environment. It is
recommended to check the Infiniband cables, all servers and Infiniband Switch. Replace the
Infiniband switch if possible. See the appendix for details on Engineering's analysis.

- Replacement of the Infiniband Switch made no difference.


- Replacement of the Network Switch / Hub made no difference either
- All cable connections have been checked and rechecked, there is nothing obvious to see
- Replacement of the SS/ADI board made no difference

The problem persists. The system appears to be able to start up successfully only two out of ten
attempts. (20% success).

As all faultfinding means seemed to have been exhausted, and as the pressure to complete the
installation of the system on time was rising, it was decided to replace the entire Rec box. After
replacement of the entire rec box, the problem did not occur anymore. The system was restarted at
least eight times without any abnormalities.

Solution
Replacement of the Rec box.

Notes
This solution was chosen after exhausting all available faultfinding scenarios and due to time
pressure to complete the installation at the site. It was decided to further try to isolate the defective
component at the local head office, where an Aquilion PRIME system could be made available for
testing purposes.
Results:
It proved to be possible to reproduce the problem with the original Rec box installed in the system
at the local head office. Further inspections revealed that the Infiniband cable (CRS16) between
the RTM server and the Infiniband switch was causing the problems. The sealing on one connector
looks remarkably abnormal (figure 3).
Server 0
--------
Error Code # 0 >
-----------------------------------------
- TMRU Error (0x9b02c000)
- SW_Entity = Startup Task
- ErrorCode = 0x00000002
- ErrorDescription = StartupTask initialization failed
- FirstSubsystem = CPU Server
- FirstServer = 0
-----------------------------------------

Error Code # 1 >


-----------------------------------------
- TMRU Error (0xa402c000)
- SW_Entity = Data Control Process
- ErrorCode = 0x00000002
- ErrorDescription = DCP initialization failed
- FirstSubsystem = CPU Server
- FirstServer = 0
-----------------------------------------

Error Code # 2 >


-----------------------------------------
- TMRU Error (0xa102c000)
- SW_Entity = RocketLink Driver
- ErrorCode = 0x00000002
- ErrorDescription = RocketLink driver initialization failure
- FirstSubsystem = CPU Server
- FirstServer = 0
-----------------------------------------

Error # 0 > 000000454409.484 : TaskStarter : TaskStarte~omm.c(101) : __ERROR__:


Received reply with 0x9b02c100 status

Error # 1 > 000000454409.565 : TaskStarter : TaskStarte~omm.c(101) : __ERROR__:


Received reply with 0x9b02c200 status

Error # 2 > 000000454546.044 : TaskStarter : TaskStarte~omm.c(101) : __ERROR__:


Received reply with 0x9b02c300 status

Error # 3 > 000000461725.221 : StartupTask : StartupTask.c (921) : __ERROR__:


Timeout expired (300000 ms) out waiting for Command=00000f04 from process DCP_Process_0

Error # 4 > 000001421276.923 : RL_Process_0 : SPIF_RLD_Main.c (66) : __ERROR__:


Error - Unexpected message type received (4209) while waiting for CMD_INIT message

Error # 5 > 000001446272.548 : TaskStarter : MsgUtil.c (1044): __ERROR__:


Timed Out waiting for message 0x00004103 on QueueID=UnnamedQueue_00001160 (00001160)
Warning # 0 > 000000275871.096 : MCP_Process_0 : MCP_tx_thread.c (102) : __WARNING__:
MCP TX: MLL_QueueGetBoardNum(DestID=0x00000081, SrcID=0x00000081, Cmd=0x00010000) failed,
other_board_num: 0, my_board_num: 0, msg will be skipped

Warning # 1 > 000001417008.483 : MCP_Process_0 : MCP_tx_thread.c (102) : __WARNING__:


MCP TX: MLL_QueueGetBoardNum(DestID=0x0000008a, SrcID=0x0000008a, Cmd=0x00000001) failed,
other_board_num: 0, my_board_num: 0, msg will be skipped

Warning # 2 > 000001419008.532 : MCP_Process_0 : MCP_tx_thread.c (102) : __WARNING__:


MCP TX: MLL_QueueGetBoardNum(DestID=0x0000008a, SrcID=0x0000008a, Cmd=0x0000140c) failed,
other_board_num: 0, my_board_num: 0, msg will be skipped

Warning # 3 > 000001421231.877 : TaskStarter : PTL_Event.c (535) : __WARNING__:


Possibly found only one entry, no events will be dumped

Warning # 4 > 000001421234.849 : TaskStarter : PTL_Event.c (535) : __WARNING__:


Possibly found only one entry, no events will be dumped

Warning # 5 > 000001446272.563 : TaskStarter : TaskStarte~ges.c(1265): __WARNING__:


Unable to get status from Startup Task, return status: 0x823f0000

Figure 2 Modified abstract of errlog.RTMsummary


Figure 3 Remarkably abnormal sealing of one of the Infiniband cable connectors

Open ends
- Decent faultfinding scenarios are not available in the service manual
Appendix Engineering's analysis of the logs

Check the files Print_0.log or BootP_0.log first:


The following information may be found:

000000166808.382 : NON-PTL-PRINTF: : -------------------------------------------------


000000166808.429 : NON-PTL-PRINTF: : OpenSM 3.2.1-0bc7db2
000000166808.473 : NON-PTL-PRINTF: : Command Line Arguments:
000000166808.539 : NON-PTL-PRINTF: : Log file max size is 5242880 bytes
000000166808.596 : NON-PTL-PRINTF: : Log File: /tmp/opensm.log
000000166808.647 : NON-PTL-PRINTF: : -------------------------------------------------
000000166809.478 : NON-PTL-PRINTF: : OpenSM 3.2.1-0bc7db2
000000166809.553 : NON-PTL-PRINTF: :
000000166810.537 : NON-PTL-PRINTF: : Entering DISCOVERING state
000000166810.663 : NON-PTL-PRINTF: :
000000166868.221 : NON-PTL-PRINTF: : Using default GUID 0x2c902002a2085
000000278853.151 : MCP_Process_0 : MCP_tx_thread.c (102) : __WARNING__: MCP TX:
MLL_QueueGetBoardNum(DestID=0x00000081, SrcID=0x00000081, Cmd=0x00010000) failed, other_board_num: 0,
my_board_num: 0, msg will be skipped
000000457393.461 : TaskStarter : PTL_Lib.c (5363): PTL_ERROR_REPORT
CALLER LOCATION : TaskStarterProcess.c (465)
PTL LOCATION : PTL_Lib.c (3107)
PTL_ERROR : Receive Message Error: Timeout
SYS_ERROR : Rx Message Queue error - receive timeout
OS_ERROR : (EOK) - No Error

This means that a problem occurred in the "MCP_process".


MCP stands for Message Control Process, which is the process responsible for communication
between servers or within servers. This communication is done by Infiniband.

Checked the opensm.log next (comparison between normal situation and problem case in UK):

<NORMAL CASE>
May 25 08:25:34 988820 [0001] 0x03 -> OpenSM 3.2.1-0bc7db2
May 25 08:25:34 988820 [0001] 0x80 -> OpenSM 3.2.1-0bc7db2
May 25 08:25:34 988820 [0001] 0x02 -> osm_vendor_init: 1000 pending umads specified
May 25 08:25:34 989820 [0001] 0x80 -> Entering DISCOVERING state
May 25 08:25:35 045811 [0001] 0x02 -> osm_vendor_bind: Binding to port 0x2c902002a4129
May 25 08:25:35 215785 [0001] 0x02 -> osm_vendor_bind: Binding to port 0x2c902002a4129
May 25 08:25:35 231783 [000C] 0x80 -> Entering MASTER stat

<PROBLEM CASE in UK>


Jul 01 12:48:43 335368 [0001] 0x03 -> OpenSM 3.2.1-0bc7db2
Jul 01 12:48:43 335368 [0001] 0x80 -> OpenSM 3.2.1-0bc7db2
Jul 01 12:48:43 336368 [0001] 0x02 -> osm_vendor_init: 1000 pending umads specified
Jul 01 12:48:43 336368 [0001] 0x80 -> Entering DISCOVERING stat
(No information is here)

The last three lines, marked red in the normal case, are missing in the file from the UK, meaning
that OpenSM did not work. It is not clear which hardware is abnormal at that time.

OpenSM is the "Subnet Manager" software for the Infiniband network. This software is running in
the RTM. This software is required because the Infiniband switch in the current product (PRIME)
does not have the built-in "management" function.

The originally used Infiniband switch in Aquilion ONE had a built-in management function; i.e. it
was a "managed” switch.

The management function is required for Infiniband to control the communication routes between
the servers.

You might also like