CB 960 Offline Online RMA

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 5

Subject:RE: Tech SR: 2015-0429-0357, CM-PB-BJ:BJWJPCRT02:ALL CB CHECK STATES

The device has appeared some error logs and alarms, as belows:

BJWJ-PC-CMNET-RT02-RE0> show chassis alarms


6 alarms currently active
Alarm time Class Description
2014-11-24 15:32:05 HKT Minor Check CB 1 Fabric Chip 1
2014-11-24 15:32:05 HKT Minor Check CB 1 Fabric Chip 0
2014-11-24 15:32:05 HKT Minor Check CB 0 Fabric Chip 1
2014-11-24 15:32:05 HKT Minor Check CB 0 Fabric Chip 0
2014-11-24 15:24:08 HKT Minor CB 2 Fabric Chip 1 Not Online
2014-11-24 15:24:06 HKT Minor CB 2 Fabric Chip 0 Not Online

BJWJ-PC-CMNET-RT02

Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc0 MIC0 IX PCI Fatal Error detected.


Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc0 Ixchip(0): pio_handle(0x4e4ee458);
pio_read_u32() failed: 20(input/output error)! trin_hostif-addr=000000b0
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc0 ix_isr: Failed to read int_enable
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc2 MIC0 IX PCI Fatal Error detected.
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc2 Ixchip(0): pio_handle(0x4e4ee458);
pio_read_u32() failed: 20(input/output error)! trin_hostif-addr=000000b0
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc2 ix_isr: Failed to read int_enable
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc0 Ixchip(0): pio_handle(0x4e4ee458);
pio_read_u32() failed: 1(generic failure)! trin_hostif-addr=000000b4
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc2 Ixchip(0): pio_handle(0x4e4ee458);
pio_read_u32() failed: 1(generic failure)! trin_hostif-addr=000000b4
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc4 PQ3_IIC(RD): bus arbitration lost on
byte 15
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc4 PQ3_IIC(RD): transfer not complete on
byte 15
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc4 PQ3_IIC(RD): I/O error (i2c_stat=0x16,
i2c_ctl[1]=0x80, bus_addr=0x51)
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc4 PQ3_IIC(WR): no target ack on byte 0
(wait spins 3)
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 mib2d[2060]: SNMP_TRAP_LINK_DOWN: ifIndex
694, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/3/1
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 mib2d[2060]: SNMP_TRAP_LINK_DOWN: ifIndex
629, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-2/2/1
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 eventd: sendto: Network is down
Nov 24 15:23:59 BJWJ-PC-CMNET-RT02-RE0 fpc4 PQ3_IIC(WR): I/O error (i2c_stat=0xa3,
i2c_ctl[1]=0xb0, bus_addr=0x71)
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 rmopd[2056]: RMOPD_ICMP_SENDMSG_FAILURE:
sendmsg(ICMP): No route to host
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 alarmd[1509]: Alarm set: CB color=YELLOW,
class=CHASSIS, reason=Check CB 0 Fabric Chip 0
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 craftd[1510]: Minor alarm set, Check CB 0
Fabric Chip 0
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 alarmd[1509]: Alarm set: CB color=YELLOW,
class=CHASSIS, reason=Check CB 0 Fabric Chip 1
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 alarmd[1509]: Alarm set: CB color=YELLOW,
class=CHASSIS, reason=Check CB 1 Fabric Chip 0
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 alarmd[1509]: Alarm set: CB color=YELLOW,
class=CHASSIS, reason=Check CB 1 Fabric Chip 1
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 craftd[1510]: Minor alarm set, Check CB 0
Fabric Chip 1
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 craftd[1510]: Minor alarm set, Check CB 1
Fabric Chip 0
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 craftd[1510]: Minor alarm set, Check CB 1
Fabric Chip 1
Nov 24 15:24:00 BJWJ-PC-CMNET-RT02-RE0 rmopd[2056]: RMOPD_ICMP_SENDMSG_FAILURE:
sendmsg(ICMP): No route to host
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 last message repeated 3 times
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 /kernel: ae_bundlestate_ifd_change: bundle
ae4: bundle IFD minimum links not met 0 < 1
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 /kernel: ae_bundlestate_ifd_change: bundle
ae5: bundle IFD minimum links not met 0 < 1
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 rmopd[2056]: RMOPD_ICMP_SENDMSG_FAILURE:
sendmsg(ICMP): No route to host
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 rmopd[2056]: RMOPD_ICMP_SENDMSG_FAILURE:
sendmsg(ICMP): No route to host
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]:
CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 0 failed because of crc
errors
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]:
CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 1 failed because of crc
errors
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]:
CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 36 failed because of crc
errors
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]:
CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 37 failed because of crc
errors
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]:
CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 52 failed because of crc
errors
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]:
CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 53 failed because of crc
errors
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]: Link failure happened for
DPC4 PFE0
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]: Link failure happened for
DPC4 PFE1
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]: Link failure happened for
DPC0 PFE0
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]: Link failure happened for
DPC0 PFE1
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]: Link failure happened for
DPC2 PFE0
Nov 24 15:24:01 BJWJ-PC-CMNET-RT02-RE0 chassisd[1508]: Link failure happened for
DPC2 PFE1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

From: Mengzhe Hu Sent: Friday, May 15, 2015 10:52 AM To: '??' Cc: support Subject:
RE: Tech SR: 2015-0429-0357,CM-PB-BJ:BJWJPCRT02:ALL CB CHECK STATES

I understand that all CBs are in check state since 24 Nov 2014:
root@BJWJ-PC-CMNET-RT02-RE0> show chassis alarms no-forwarding

6 alarms currently active


Alarm time Class Description
2014-11-24 15:32:05 HKT Minor Check CB 1 Fabric Chip 1
2014-11-24 15:32:05 HKT Minor Check CB 1 Fabric Chip 0
2014-11-24 15:32:05 HKT Minor Check CB 0 Fabric Chip 1
2014-11-24 15:32:05 HKT Minor Check CB 0 Fabric Chip 0
2014-11-24 15:24:08 HKT Minor CB 2 Fabric Chip 1 Not Online
2014-11-24 15:24:06 HKT Minor CB 2 Fabric Chip 0 Not Online
Hello Paul,

I�ve went through the actions you have done and the logs before/after event.

->First. I have to tell you that CB0 and CB1 has not really been offlined/onlined.
That is why you are still seeing the alarms onCB0 and CB1

BJWJ-PC-CMNET-RT02-RE0> request chassis cb offline slot 0


Master CB 0 cannot be set offline

BJWJ-PC-CMNET-RT02-RE0> request chassis cb online slot 0


CB 0 appears to be online already

BJWJ-PC-CMNET-RT02-RE0> request chassis cb offline slot 1


Backup CB 1 cannot be set offline, backup RE is online

BJWJ-PC-CMNET-RT02-RE0> request chassis cb online slot 1


CB 1 appears to be online already

->Second. As only the fabric plane is affected, we can offline/online the fabric
plane. In this case, the routing-engine on CB won�t be affected. And since MX960 is
2+1 CB redundancy, offline/online the fabric plane one by one should not affect the
traffic.

Commands to do. Before executing each commands, using �show chassis fabric plane�
to verify
request chassis fabric plane 0 offline
request chassis fabric plane 0 online
request chassis fabric plane 1 offline
request chassis fabric plane 1 online
request chassis fabric plane 2 offline
request chassis fabric plane 2 online
request chassis fabric plane 3 offline
request chassis fabric plane 3 online

->Third. If the soft reset doesn�t help. We have to go to the process of physically
reset the CB0 AND CB1. In order to reducetraffic impact, we can RMA the CBs first
and do the replacement in one maintenance window.

Please refer to the following link on MX960 fabric plane:


https://kb.juniper.net/InfoCenter/index?
page=contentid=KB23173actp=searchviewlocale=en_USsearchid=1431700921613

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Subject:Tech SR: 2015-0429-0357, CM-PB-BJ:BJWJPCRT02:ALL CB CHECK STATES

Hello Paul,

This is Mengzhe Hu from Juniper Networks Advanced TAC US team.

I have now taken the ownership of Tech SR: 2015-0429-0357, CM-PB-BJ:BJWJPCRT02:ALL


CB CHECK STATES.
I understand that all CBs are in check state since 24 Nov 2014:
root@BJWJ-PC-CMNET-RT02-RE0> show chassis alarms no-forwarding

6 alarms currently active


Alarm time Class Description
2014-11-24 15:32:05 HKT Minor Check CB 1 Fabric Chip 1
2014-11-24 15:32:05 HKT Minor Check CB 1 Fabric Chip 0
2014-11-24 15:32:05 HKT Minor Check CB 0 Fabric Chip 1
2014-11-24 15:32:05 HKT Minor Check CB 0 Fabric Chip 0
2014-11-24 15:24:08 HKT Minor CB 2 Fabric Chip 1 Not Online
2014-11-24 15:24:06 HKT Minor CB 2 Fabric Chip 0 Not Online

I've checked the logs, it looks like there were transient CRC errors on all the
fabric plane back then:
Line 40291: Nov 24 15:24:01 CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link
0 failed because of crc errors
Line 40293: Nov 24 15:24:01 CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link
1 failed because of crc errors
Line 40295: Nov 24 15:24:01 CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link
36 failed because of crc errors
Line 40297: Nov 24 15:24:01 CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link
37 failed because of crc errors
Line 40299: Nov 24 15:24:01 CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link
52 failed because of crc errors

The "CHECK" status will not be cleared by its own unless we reboot/reseat the CB.
As you don't have NSR enabled, is it possible toschedule a MW to reboot/reseat
these CBs one by one?

Also, "show log messages" is rolled over, it doesn't give any information at 2014-
11-24.

From "show log chassisd", it looks like there's some temperature issue before issue
happen across the system. Can you recall anyrelated event in the data center?
Nov 24 15:24:01 FPC 0 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 0 temp sensor not ok, status 0x8 failed 1 times
Nov 24 15:24:01 FPC 0 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 0 temp sensor not ok, status 0x8 failed 1 times
Nov 24 15:24:01 FPC 0 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 0 temp sensor not ok, status 0x8 failed 1 times
Nov 24 15:24:01 FPC 2 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 2 temp sensor not ok, status 0x8 failed 1 times
Nov 24 15:24:01 FPC 2 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 2 temp sensor not ok, status 0x8 failed 1 times
Nov 24 15:24:01 FPC 2 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 2 temp sensor not ok, status 0x8 failed 1 times
Nov 24 15:24:01 FPC 4 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 4 temp sensor not ok, status 0x8 failed 1 times
Nov 24 15:24:01 FPC 4 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 4 temp sensor not ok, status 0x8 failed 1 times
Nov 24 15:24:01 FPC 4 temperature is -60 degrees C, which is outside operating
range
Nov 24 15:24:01 FPC 4 temp sensor not ok, status 0x8 failed 1 times

Collect the following to confirm the issue on fabric:


Show chassis fabric fpcs
Show chassis fabric map
Show chassis fabric plane
Show chassis fabric plane-location
Show chassis fabric reachability
Show chassis fabric summary
show chassis environment
show chassis environment cb
show chassis environment fpc

Next action plan: Collect the data to double confirm the issueReseat/reboot CB one
by one. Most situations, "CHECK" status will goaway after reboot. If "CHECK" status
continue, we will go ahead and create RMA for these CBs.

You might also like