Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Create a flavor with the name gpu.<gpu-model>.<gpu-memory>
    • For our V100, this will be gpu.v100.8G
  • Add the custom metadata: aggregate_instance_extra_specs:node_type = gpu-<gpu-model>-<gpu-memory>
    • For our V100, the metadata is: aggregate_instance_extra_specs:node_type = gpu-v100-8g

Rebooting/downtime on the GPU node

The correct procedure to shut down GPU nodes are to first shut down the instances and then shelve them before rebooting the node. This is due to the grid drivers on the node isn't created on boot and nova compute will crash when trying to start instances with drivers unavailable. When unshelving the instances, the drivers will be created.

Fixing GPU node after unscheduled downtime

Stop the following services

  • puppet.service
  • openstack-nova-compute
  • neutron-openvswitch-agent

Create the devices

Code Block
# Find the devices
for a in $(virsh list --all | cut -b7- | grep instance | cut -d' ' -f1); do echo $a; virsh dumpxml $a |grep mdev -A2; done
instance-0000ef36
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='d4965599-e288-45ff-816a-c9f1ba6dfe77'/>
instance-0000ef39
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='558587e6-18a6-4b78-a6b2-01372e07e507'/>
instance-0000ef3c
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='a0085b1c-f13a-48ee-acc5-fb497d33db78'/>
# Creating the devices
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn0/mdev_supported_types/nvidia-471/create
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn1/mdev_supported_types/nvidia-471/create
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn2/mdev_supported_types/nvidia-471/create
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn3/mdev_supported_types/nvidia-471/create

# Example
echo "d4965599-e288-45ff-816a-c9f1ba6dfe77" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn0/mdev_supported_types/nvidia-471/create
echo "558587e6-18a6-4b78-a6b2-01372e07e507" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn1/mdev_supported_types/nvidia-471/create
echo "a0085b1c-f13a-48ee-acc5-fb497d33db78" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn2/mdev_supported_types/nvidia-471/create

Start a puppet run, it will start the services stopped previously. Might have to enable puppet

  • puppet agent --enable # if needed
  • puppet agent -t

Since we've created the drivers without ensuring the correct drivers, the instances will start ok now, but if destroying one, wrong driver might be freed. So now instances need to be stopped, shelved, unshelved and started. Now it should be correct.