Page MenuHomePhabricator

hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
    • cloudvirt1042.eqiad.wmnet
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
    • Server is out of service and can be worked at any time.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
    • No particular urgency.
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
    • Server fails to boot after a reimage.
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

T364984: cloudvirt1041: can't boot after reimage was this exact same issue on an another cloudvirt from the same batch. That one was fixed by upgrading the NIC firmware to 21.81.

I tried to upgrade the firmware myself, but the cookbook (sudo cookbook sre.hardware.upgrade-firmware -c nic --new cloudvirt1042.eqiad.wmnet) only shows an option to upgrade to 22.9 which AIUI has other issues and should be avoided for now. Is there a way I can upgrade the firmware by myself? If not, could you please upgrade the firmware on this server? I'm happy to take care of the reimage after it's been upgraded etc.

Event Timeline

taavi renamed this task from hw troubleshooting: cloudvirt1042 fails to boot after a reimage to hw troubleshooting: cloudvirt1042, cloudvirt1043 fails to boot after a reimage.Wed, Jun 19, 12:53 PM

cloudvirt1043 seems to be having the same issue too. So this may be an issue for the entire batch.

As suggested by volans I tried running the firmware-upgrade cookbook on the other cumin server which had the correct version cached. That just finished, so I'm trying to reimage cloudvirt1042 with the 21.81 firmware release now.

taavi claimed this task.
taavi added a subscriber: Jclark-ctr.

The reimages finished succesfully after a firmware upgrade.