Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

We have a few compute-nodes with Nvidia GPUs that supports GRID and vGPUs. This explains what's needed to make that work. You basically need the correct puppet role and som hiera-magic. When that's in place, we need a host aggregate and a custom flavor to make sure that only VMs with a VGPU get scheduled onto our GPU-nodes.

...

Code Block
1. Find the name of the GRID profile you need: https://docs.nvidia.com/grid/11.0/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
# nvidia-smi vgpu -s (list all supported vGPU types)
# nvidia-smi vgpu -c (list all creatable vGPU types)

2. Find the PCI-device address(es) for the GPU(s): 

# lspci | grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

3. Go to the folder for one of them in sysfs
# cd /sys/class/mdev_bus/0000\:3b\:00.0/mdev_supported_types/

4. Find the type with your selected name
# grep -l "V100D-8Q" nvidia-*/name
nvidia-183/name

5. Now you know which type to set in the hiera-key

...

  • Create a host aggregate with name gpu-<gpu-model>-<gpu-memory>.
    • For our V100, this will be gpu-v100-8g
  • Add the custom metadata: node_type = <same name as the host aggregate>

Trait

To support multiple VGPU types, the GPU resource providers needs to be tagged with a custom trait, that says something about which VGPU type they provide. We do this on all servers for consistency, even if they are not supposed to support multiple types

Code Block
export OS_PLACEMENT_API_VERSION=1.6

# Create a new trait
openstack trait create CUSTOM_<GPU-MODEL>_<NN>G
# example name: CUSTOM_A100_20G

# Add the trait to a corresponding resource provider
openstack resource provider trait set --trait CUSTOM_A100_20G <resource provider uuid>

# To get the uuid for the above command, look in
openstack resource provider list
# And find the resource provider for a given PCI device, they're typically named something like this: gpu02.infra.skyhigh.iik.ntnu.no_pci_0000_e2_00_4


Flavor

Finally, we need a flavor for each VGPU type, that will ask for a node in the correct host aggregate, and with the correct trait

  • Create a flavor with the name gpu.<gpu-model>.<gpu-memory>
    • For our V100, this will be gpu.v100.8G
  • Add the custom metadata: resources:VGPU=1
  • Add the custom metadata: aggregate_instance_extra_specs:node_type = gpu-<gpu-model>-<gpu-memory>
    • For our V100, the metadata is: aggregate_instance_extra_specs:node_type = gpu-v100-8g
  • Add the custom metadata: trait:TRAITNAME=required
    • For example: trait:CUSTOM_A100_20G=required

Rebooting/downtime on the GPU node

The correct procedure to shut down GPU nodes are to first shut down the instances and then shelve them before rebooting the node. This is due to the grid drivers on the node isn't created on boot and nova compute will crash when trying to start instances with drivers unavailable. When unshelving the instances, the drivers will be created.

Fixing GPU node after unscheduled downtime

Info
titleThis procedure is automated!

This procedure can now be done with the help of our script "fix-gpu-mdevs.sh" in our tools repo.

This is "disaster recovery" and should only be necessary in the event of a powerloss, or in the event of a sysadmin that forgot the proper way to reboot a GPU node (which is to shelve all VMs before reboot).

tl;dr: When the node reboots, the mdev device files for all VGPUs disappears, and nova-compute will fail to start the VMs because these devices are gone. Thus, we have to recreate the mdev devices manually, with the same UUIDs they had before the reboot.

Stop the following services

  • puppet.service
  • openstack-nova-compute
  • neutron-openvswitch-agent

Create the devices

Code Block
# Find out which GRID profile the node is using. This can be found in the node-specific hiera-file, in the key nova::compute::vgpu::vgpu_types_device_addresses_mapping

# Find the devices
for a in $(virsh list --all | cut -b7- | grep instance | cut -d' ' -f1); do echo $a; virsh dumpxml $a |grep mdev -A2; done
instance-0000ef36
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='d4965599-e288-45ff-816a-c9f1ba6dfe77'/>
instance-0000ef39
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='558587e6-18a6-4b78-a6b2-01372e07e507'/>
instance-0000ef3c
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='a0085b1c-f13a-48ee-acc5-fb497d33db78'/>

# Creating the devices
# On GPUs using SR-IOV for VGPUs (the A100):
# SR-IOV cards has one VGPU per VF. Do this for all devices in /sys/class/mdev_bus/
echo "uuid" > /sys/class/mdev_bus/*/mdev_supported_types/<grid-profile>/create

# Example
echo "d4965599-e288-45ff-816a-c9f1ba6dfe77" > /sys/class/mdev_bus/0000\:86\:00.4/mdev_supported_types/nvidia-471/create

# On GPUs not using SR-IOV, there is no VFs, and we must create multiple mdev's on the same PCI device
# Then you basically run the above example N times per device in /sys/class/mdev_bus, where N is how many VGPUs you need per physical GPU.

Start a puppet run, it will start the services stopped previously. Might have to enable puppet

  • puppet agent --enable # if needed
  • puppet agent -t

The manually re-created mdev files may now be mapped to a different VF than before the reboot. This is OK when you start the existing instances. But, this will cause nova/placement to free the wrong device when a VM is deleted, and that will further lead to a NoValidHost error if you try to create a new VM with a VGPU. Workaround: instances need to be stopped, shelved, unshelved and started. Now it should be correct

This is mainly a problem on the SR-IOV enabled GPUs, but can also happen on GPU nodes with multiple non SR-IOV GPUs.