Some pci-e resources can be useful to just hand over to VM's directly, like GPU's and infiniband networking cards. This article describes the steps necessary to configure pci-passthrough to hand PCI-devices to certain flavors in the openstack cloud.

Enable IOMMU on the compute-node

First vt-d needs to be enabled in the systems BIOS/UEFI menu. This option might be visible, or hidden behind a generic "Enable Virtualization Technologies". Next up is to enable the IOMMU in ubuntu by modifying /etc/default/grub to contain:

Enable IOMMU

GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on iommu=pt"
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"

Next up is to regenerate initramfs/grubconf:

Verify that IOMMU is enabled

# update-initramfs -u
# grub-mkconfig -o /boot/grub/grub.cfg

After a reboot you should be able to see that the IOMMU is enabled correctly like so:

Verify that IOMMU is enabled

# dmesg | grep  'IOMMU enabled'
[    0.632907] DMAR: IOMMU enabled
[    0.632954] DMAR: IOMMU enabled

Configure Openstack to know about the PCIe-devices.

The Compute-nodes ned to know what PCI-devices to pass-through to the VM's, and for simplicity sake its convenient to use aliases instead of PCI vendor/device ID's. So first we need to create an alias by adding a key to the global hiera:

Hieradata for PCI-device alias

nova::pci::aliases:
 - name: 'p100'
   vendor_id: '10de'
   product_id: '15f8'
   device_type: 'type-PCI'
   numa_policy: 'preferred'

Next up is to add which devices to pass-through in the node-specific hiera-file for the gpu-node:

Hieradata for GPU-node

ntnuopenstack::nova::compute::providers:
 - name: "%{::fqdn}"
   traits: [ 'CUSTOM_COMPUTE_GPU' ]
nova::compute::pci::passthrough:
  - vendor_id: '10de'
    product_id: '15f8'

Configure host-aggregates to aid in the scheduling.

Openstack itself need to know how to schedule to a certain machine, and how to avoid scheduling to the wrong machine. To help us here we create host-aggregates with the 'node-type' key set to a value that is also reflected in the VM flavors, and thus having the scheduler to only schedule a certain flavor of a VM to a certain sets of host defined in the host-aggregate. For the pass-through of the p100 cards we create a host-aggregate that looks like this:

GPU Host aggregate

$ openstack aggregate show gpu-p100
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| availability_zone | None                                 |
| created_at        | 2024-01-24T09:29:54.000000           |
| deleted_at        | None                                 |
| hosts             | gpu-b08-01-34                        |
| id                | 6                                    |
| is_deleted        | False                                |
| name              | gpu-p100                             |
| properties        | node_type='gpu-p100'                 |
| updated_at        | None                                 |
| uuid              | 5b39a2b5-9edf-41a1-8c02-ca5a03bc9fe7 |
+-------------------+--------------------------------------+

The important bits here is to set a certain node-type in the properties, and add the hosts with the PCI-devices in them into the aggregate.

Create a flavor with PCI-e devices attached

Flavors are easiest created using our flavoradmin-scripts. For the p100-cards in this example the flavors might look like this:

GPU Host aggregate

[
{
  "Name": "dx2.6c50r.p100",
  "CPU": "6",
  "RAM": "51200",
  "Disk": "40",
  "hw:cpu_cores": 6, "hw:cpu_sockets": 1, "hw:cpu_threads": 1,
  "quota:disk_read_iops_sec": 300, "quota:disk_write_iops_sec": 300,
  "hw_rng:allowed": true, "hw_rng:rate_bytes": 24, "hw_rng:rate_period": 5000,
  "aggregate_instance_extra_specs:node_type": "gpu-p100",
  "pci_passthrough:alias": "p100:1",
  "visibility": "private"
},
{
  "Name": "dx2.12c100r.2p100",
  "CPU": "12",
  "RAM": "102400",
  "Disk": "40",
  "hw:cpu_cores": 12, "hw:cpu_sockets": 2, "hw:cpu_threads": 1,
  "quota:disk_read_iops_sec": 300, "quota:disk_write_iops_sec": 300,
  "hw_rng:allowed": true, "hw_rng:rate_bytes": 24, "hw_rng:rate_period": 5000,
  "aggregate_instance_extra_specs:node_type": "gpu-p100",
  "pci_passthrough:alias": "p100:2",
  "visibility": "private"
}
]

Verify that it works

Create a VM, and see that it got the PCI-device:

GPU Host aggregate

 $ lspci | grep NVIDIA
00:05.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)

Page tree

PCI-Passthrough

Enable IOMMU on the compute-node

Configure Openstack to know about the PCIe-devices.

Configure host-aggregates to aid in the scheduling.

Create a flavor with PCI-e devices attached

Verify that it works