Reinstall GPU-nodes to Ubuntu

Starting from Nvidia GRID 13.1, it is supported to run the Nvidia vGPU Manager on Ubuntu Server 18.04 and 20.04. This will allow us to return to a more homogenous state, where all types of compute nodes can run the same operating system. In order to do the reinstall, we need to do the following:

Add support for automatic install of the Nvidia driver on Ubuntu with Shiftleader
Shelve all running VMs on the GPU-node that should be reinstalled
Re-install the GPU node
Re-apply the correct puppet role

The folliwing PostInstall-fragment for Shiftleader should work

# Install GRID driver / vGPU manager
echo "Installing Nvidia GRID 13.1 LTS vGPU Manager" >> $logfile
rmmod nouveau
sed -i 's/GRUB_CMDLINE_LINUX="[^"]*/& rd.driver.blacklist=nouveau nouveau.modeset=0/' /etc/default/grub
sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="[^"]*/& rd.driver.blacklist=nouveau nouveau.modeset=0/' /etc/default/grub

echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf
echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
update-initramfs -u

wget http://rpm.iik.ntnu.no/nvidia/nvidia-vgpu-ubuntu-470_470.82_amd64.deb -O /tmp/nvidia-grid.deb
apt -y install /tmp/nvidia-grid.deb
rm /tmp/nvidia-grid.deb

In my tests, I needed an extra reboot because all kernel modules was not loaded the first time. To verify, run lsmod and check that the vfio module is enabled. If not, reboot.

lsmod | grep vfio
nvidia_vgpu_vfio       53248  19
vfio_mdev              16384  1
mdev                   24576  2 vfio_mdev,nvidia_vgpu_vfio

Now it's time for the funny part.....

Since we have a slightly different behaviour in CentOS vs Ubuntu when it comes to hostnames, we get conflicts in the Openstack Placement service when the reinstalled node tries to start nova-compute. In CentOS, servers are given its FQDN as hostname, but in Ubuntu it does not. So, when the GPU node is reinstalled with Ubuntu and starts nova-compute, Openstack Placement will see this as a new node, and try to add it. That obviously fails, since there already is a node in the placement database with the same name/FQDN...

Luckily, we have shelved all the VMs that was running on the GPU node, so there is no allocated resources tied to the "old" node in placement. Here is what you need to do:

Stop nova-compute (and puppet, so your're 100% sure it's not restarted during the process)

Delete the old record from the nova service catalog:

openstack compute service list # Note the ID for the table entry that has a FQDN as name
openstack compute service delete <id>

Delete the OpenVswitch agent from the neutron service catalog:

openstack network agent list # Note the UUID of the Open vSwitch agent for the table entry that has a FQDN as name 
openstack network agent delete <uuid>

Check that there is no traces of the "old" node in placement

openstack resouce provider list
# The deletion from the nova service catalog may have deleted the resources from placement. If not:
openstack resource provider delete <uuid> # The GPU nodes have child resources for each GPU. Delete these first

Restart nova-compute

Page tree

Reinstall GPU-nodes to Ubuntu