Some pci-e resources can be useful to just hand over to VM's directly, like GPU's and infiniband networking cards. This article describes the steps necessary to configure pci-passthrough to hand PCI-devices to certain flavors in the openstack cloud.
Enable IOMMU on the compute-node
First vt-d needs to be enabled in the systems BIOS/UEFI menu. This option might be visible, or hidden behind a generic "Enable Virtualization Technologies". Next up is to enable the IOMMU in ubuntu by modifying /etc/default/grub to contain:
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on iommu=pt" GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"
Next up is to regenerate initramfs/grubconf:
# update-initramfs -u # grub-mkconfig -o /boot/grub/grub.cfg
After a reboot you should be able to see that the IOMMU is enabled correctly like so:
# dmesg | grep 'IOMMU enabled' [ 0.632907] DMAR: IOMMU enabled [ 0.632954] DMAR: IOMMU enabled
Configure Openstack to know about the PCIe-devices.
The Compute-nodes ned to know what PCI-devices to pass-through to the VM's, and for simplicity sake its convenient to use aliases instead of PCI vendor/device ID's. So first we need to create an alias by adding a key to the global hiera:
nova::pci::aliases: - name: 'p100' vendor_id: '10de' product_id: '15f8' device_type: 'type-PCI' numa_policy: 'preferred'
Next up is to add which devices to pass-through in the node-specific hiera-file for the gpu-node:
ntnuopenstack::nova::compute::providers: - name: "%{::fqdn}" traits: [ 'CUSTOM_COMPUTE_GPU' ] nova::compute::pci::passthrough: - vendor_id: '10de' product_id: '15f8'
Configure host-aggregates to aid in the scheduling.
Openstack itself need to know how to schedule to a certain machine, and how to avoid scheduling to the wrong machine. To help us here we create host-aggregates with the 'node-type' key set to a value that is also reflected in the VM flavors, and thus having the scheduler to only schedule a certain flavor of a VM to a certain sets of host defined in the host-aggregate. For the pass-through of the p100 cards we create a host-aggregate that looks like this:
$ openstack aggregate show gpu-p100 +-------------------+--------------------------------------+ | Field | Value | +-------------------+--------------------------------------+ | availability_zone | None | | created_at | 2024-01-24T09:29:54.000000 | | deleted_at | None | | hosts | gpu-b08-01-34 | | id | 6 | | is_deleted | False | | name | gpu-p100 | | properties | node_type='gpu-p100' | | updated_at | None | | uuid | 5b39a2b5-9edf-41a1-8c02-ca5a03bc9fe7 | +-------------------+--------------------------------------+
The important bits here is to set a certain node-type in the properties, and add the hosts with the PCI-devices in them into the aggregate.
Create a flavor with PCI-e devices attached
Flavors are easiest created using our flavoradmin-scripts. For the p100-cards in this example the flavors might look like this:
[ { "Name": "dx2.6c50r.p100", "CPU": "6", "RAM": "51200", "Disk": "40", "hw:cpu_cores": 6, "hw:cpu_sockets": 1, "hw:cpu_threads": 1, "quota:disk_read_iops_sec": 300, "quota:disk_write_iops_sec": 300, "hw_rng:allowed": true, "hw_rng:rate_bytes": 24, "hw_rng:rate_period": 5000, "aggregate_instance_extra_specs:node_type": "gpu-p100", "pci_passthrough:alias": "p100:1", "visibility": "private" }, { "Name": "dx2.12c100r.2p100", "CPU": "12", "RAM": "102400", "Disk": "40", "hw:cpu_cores": 12, "hw:cpu_sockets": 2, "hw:cpu_threads": 1, "quota:disk_read_iops_sec": 300, "quota:disk_write_iops_sec": 300, "hw_rng:allowed": true, "hw_rng:rate_bytes": 24, "hw_rng:rate_period": 5000, "aggregate_instance_extra_specs:node_type": "gpu-p100", "pci_passthrough:alias": "p100:2", "visibility": "private" } ]
Verify that it works
Create a VM, and see that it got the PCI-device:
$ lspci | grep NVIDIA 00:05.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)