vGPUs in nova

We have a few compute-nodes with Nvidia GPUs that supports GRID and vGPUs. This explains what's needed to make that work. You basically need the correct puppet role and som hiera-magic. When that's in place, we need a host aggregate and a custom flavor to make sure that only VMs with a VGPU get scheduled onto our GPU-nodes.

Official documentation - the section about custom traits is needed if we want to have different GRID profiles on servers with multiple physical GPUs

Role

The compute node must have our puppet-role openstack::compute::ceph::vgpu

Hiera

In the node-specific hiera for the gpu-node, we need to set a key that tells nova which GRID-profile to use, and which PCI devices we want to use:

nova::compute::vgpu::vgpu_types_device_addresses_mapping:
  <type>: [ '<pci-device-address>', '<pci-device-address>' ]

# Example:
nova::compute::vgpu::vgpu_types_device_addresses_mapping:
  nvidia-183: [ '0000:3b:00.0', '0000:d8:00.0' ]

The type can be discoverd from sysfs.

1. Find the name of the GRID profile you need: https://docs.nvidia.com/grid/11.0/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
# nvidia-smi vgpu -s (list all supported vGPU types)
# nvidia-smi vgpu -c (list all creatable vGPU types)

2. Find the PCI-device address(es) for the GPU(s): 
# lspci | grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

3. Go to the folder for one of them in sysfs
# cd /sys/class/mdev_bus/0000\:3b\:00.0/mdev_supported_types/

4. Find the type with your selected name
# grep -l "V100D-8Q" nvidia-*/name
nvidia-183/name

5. Now you know which type to set in the hiera-key

Host aggregate

Create a host aggregate with name gpu-<gpu-model>-<gpu-memory>.
- For our V100, this will be gpu-v100-8g
Add the custom metadata: node_type = <same name as the host aggregate>

Flavor

Finally, we need a flavor for each VGPU type, that will ask for a node in the correct host aggregate

Create a flavor with the name gpu.<gpu-model>.<gpu-memory>
- For our V100, this will be gpu.v100.8G
Add the custom metadata: aggregate_instance_extra_specs:node_type = gpu-<gpu-model>-<gpu-memory>
- For our V100, the metadata is: aggregate_instance_extra_specs:node_type = gpu-v100-8g

Rebooting/downtime on the GPU node

The correct procedure to shut down GPU nodes are to first shut down the instances and then shelve them before rebooting the node. This is due to the grid drivers on the node isn't created on boot and nova compute will crash when trying to start instances with drivers unavailable. When unshelving the instances, the drivers will be created.

Fixing GPU node after unscheduled downtime

Stop the following services

puppet.service
openstack-nova-compute
neutron-openvswitch-agent

Create the devices

# Find the devices
for a in $(virsh list --all | cut -b7- | grep instance | cut -d' ' -f1); do echo $a; virsh dumpxml $a |grep mdev -A2; done
instance-0000ef36
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='d4965599-e288-45ff-816a-c9f1ba6dfe77'/>
instance-0000ef39
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='558587e6-18a6-4b78-a6b2-01372e07e507'/>
instance-0000ef3c
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='a0085b1c-f13a-48ee-acc5-fb497d33db78'/>
# Creating the devices
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn0/mdev_supported_types/nvidia-471/create
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn1/mdev_supported_types/nvidia-471/create
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn2/mdev_supported_types/nvidia-471/create
echo "uuid" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn3/mdev_supported_types/nvidia-471/create

# Example
echo "d4965599-e288-45ff-816a-c9f1ba6dfe77" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn0/mdev_supported_types/nvidia-471/create
echo "558587e6-18a6-4b78-a6b2-01372e07e507" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn1/mdev_supported_types/nvidia-471/create
echo "a0085b1c-f13a-48ee-acc5-fb497d33db78" > /sys/bus/pci/devices/0000\:86\:00.0/virtfn2/mdev_supported_types/nvidia-471/create

Start a puppet run, it will start the services stopped previously. Might have to enable puppet

puppet agent --enable # if needed
puppet agent -t

Since we've created the drivers without ensuring the correct drivers, the instances will start ok now, but if destroying one, wrong driver might be freed. So now instances need to be stopped, shelved, unshelved and started. Now it should be correct.

Page tree