Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Fixing GPU node after unscheduled downtime

This "disaster recovery" and should only be necessary in the event of a powerloss, or in the event of a sysadmin that forgot the proper way to reboot a GPU node (which is to shelve all VMs before reboot).

tl;dr: When the node reboots, the mdev device files for all VGPUs disappears, and nova-compute will fail to start the VMs because these devices are gone. Thus, we have to recreate the mdev devices manually, with the same UUIDs they had before the reboot.

Stop the following services

...

Create the devices

Code Block
# Find out which GRID profile the node is using. This can be found in the node-specific hiera-file, in the key nova::compute::vgpu::vgpu_types_device_addresses_mapping

# Find the devices
for a in $(virsh list --all | cut -b7- | grep instance | cut -d' ' -f1); do echo $a; virsh dumpxml $a |grep mdev -A2; done
instance-0000ef36
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='d4965599-e288-45ff-816a-c9f1ba6dfe77'/>
instance-0000ef39
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='558587e6-18a6-4b78-a6b2-01372e07e507'/>
instance-0000ef3c
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='a0085b1c-f13a-48ee-acc5-fb497d33db78'/>

# Creating the devices
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn0/mdev_supported_types/nvidia-471/create
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn1/mdev_supported_types/nvidia-471/create
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn2/mdev_supported_types/nvidia-471/create# On GPUs using SR-IOV for VGPUs (the A100):
# SR-IOV cards has one VGPU per VF. Do this for all devices in /sys/class/mdev_bus/
echo "uuid" > /sys/class/mdev_bus/pci/devices/0000\:86\:00.0/virtfn3/*/mdev_supported_types/nvidia<grid-471profile>/create

# Example
echo "d4965599-e288-45ff-816a-c9f1ba6dfe77" > /sys/bus/pci/devicesclass/mdev_bus/0000\:86\:00.0/virtfn04/mdev_supported_types/nvidia-471/create
echo "558587e6-18a6-4b78-a6b2-01372e07e507" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn1/mdev_supported_types/nvidia-471/create
echo "a0085b1c-f13a-48ee-acc5-fb497d33db78" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn2/mdev_supported_types/nvidia-471/create

# On GPUs not using SR-IOV, there is no VFs, and we must create multiple mdev's on the same PCI device
# Then you basically run the above example N times per device in /sys/class/mdev_bus, where N is how many VGPUs you need per physical GPU.

Start a puppet run, it will start the services stopped previously. Might have to enable puppet

  • puppet agent --enable # if needed
  • puppet agent -t

Since we've created the drivers without ensuring the correct drivers, the instances will start ok now, but if destroying one, wrong driver might be freed. So now The manually re-created mdev files may now be mapped to a different VF than before the reboot. This is OK when you start the existing instances. But, this will cause nova/placement to free the wrong device when a VM is deleted, and that will further lead to a NoValidHost error if you try to create a new VM with a VGPU. Workaround: instances need to be stopped, shelved, unshelved and started. Now it should be correct

This is mainly a problem on the SR-IOV enabled GPUs, but can also happen on GPU nodes with multiple non SR-IOV GPUs.