vGPUs in nova

We have a few compute-nodes with Nvidia GPUs that supports GRID and vGPUs. This explains what's needed to make that work. You basically need the correct puppet role and som hiera-magic. When that's in place, we need a host aggregate and a custom flavor to make sure that only VMs with a VGPU get scheduled onto our GPU-nodes.

Official documentation - the section about custom traits is needed if we want to have different GRID profiles on servers with multiple physical GPUs

Role

The compute node must have our puppet-role openstack::compute::ceph::vgpu

Hiera

In the node-specific hiera for the gpu-node, we need to set a key that tells nova which GRID-profile to use, and which PCI devices we want to use:

nova::compute::vgpu::vgpu_types_device_addresses_mapping:
  <type>: [ '<pci-device-address>', '<pci-device-address>' ]

# Example:
nova::compute::vgpu::vgpu_types_device_addresses_mapping:
  nvidia-183: [ '0000:3b:00.0', '0000:d8:00.0' ]

The type can be discoverd from sysfs.

1. Find the name of the GRID profile you need: https://docs.nvidia.com/grid/11.0/grid-vgpu-user-guide/index.html#virtual-gpu-types-grid-reference
2. Find the PCI-device address(es) for the GPU(s): 

# lspci | grep NVIDIA
3b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
d8:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

3. Go to the folder for one of them in sysfs
# cd /sys/class/mdev_bus/0000\:3b\:00.0/mdev_supported_types/

4. Find the type with your selected name
# grep -l "V100D-8Q" nvidia-*/name
nvidia-183/name

5. Now you know which type to set in the hiera-key

Host aggregate

Create a host aggregate with name gpu-<gpu-model>-<gpu-memory>.
- For our V100, this will be gpu-v100-8g
Add the custom metadata: node_type = <same name as the host aggregate>

Flavor

Finally, we need a flavor that asks for each VGPU type, that will ask for a node in the correct host aggregate

Create a flavor with the name gpu.<gpu-model>.<gpu-memory>
- For our V100, this will be gpu.v100.8G
Add the custom metadata: aggregate_instance_extra_specs:node_type = gpu-<gpu-model>-<gpu-memory>
- For our V100, the metadata is: aggregate_instance_extra_specs:node_type = gpu-v100-8g

Page tree

vGPUs in nova

Role

Hiera

Host aggregate

Flavor