Ongoing incident
Incident description
The Nvidia GRID license server (nvidiadls02.it.ntnu.no) we use to serve VGPU licenses for GPU-enabled VMs in all of NTNUs Openstack platforms has been reinstalled without anyone telling us. This is a result of missing documentation from NTNU IT's side. Due to the lack of documentation, the engineer thought that the server was not in use.
Impact
New GPU VMs will not be able to retrive a license, and the vGPU will not work. Running VMs will over time lose their license, and will lose it upon a reboot.
Event log
Time | Event |
---|---|
15.03.24 | The server was reinstalled by NTNU IT |
19.03.24 - 13:49 | We discovered that new GPU VMs was no longer able to aquire a license - and a few minutes later it became obvious that the server had been reinstalled |
19.03.24 - 14:06 | The engineer that was involved in setting this up in June last year was contacted. Admits that he has indeed reinstalled this server. |
19.03.24 - 15:36 | The license server has been reconfigured, and is now working again. All running VMs must download a new client configuration token to be able to acquire/renew the license |
Implemented fix
The license server has been reconfigured from scratch. This means that all existing users/running VMs must download a new client configuration token in order to acquire/renew the license. This is done by running the following commands as root inside the VM:
wget https://rpm.iik.ntnu.no/nvidia/gridd.tok -O /etc/nvidia/ClientConfigToken/gridd.tok systemctl restart nvidia-gridd.service # Then verify that the nvidia-grid daemon successfully acquired a license systemctl status nvidia-gridd.service ● nvidia-gridd.service - NVIDIA Grid Daemon Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2024-03-19 14:33:47 UTC; 9s ago Process: 2273 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS) Main PID: 2274 (nvidia-gridd) Tasks: 4 (limit: 144861) Memory: 1.6M CGroup: /system.slice/nvidia-gridd.service └─2274 /usr/bin/nvidia-gridd Mar 19 14:33:47 DEMO systemd[1]: Starting NVIDIA Grid Daemon... Mar 19 14:33:47 DEMO systemd[1]: Started NVIDIA Grid Daemon. Mar 19 14:33:47 DEMO nvidia-gridd[2274]: Started (2274) Mar 19 14:33:47 DEMO nvidia-gridd[2274]: Configuration parameter ( ServerAddress ) not set Mar 19 14:33:47 DEMO nvidia-gridd[2274]: vGPU Software package (0) Mar 19 14:33:47 DEMO nvidia-gridd[2274]: Ignore service provider and node-locked licensing Mar 19 14:33:47 DEMO nvidia-gridd[2274]: NLS initialized Mar 19 14:33:47 DEMO nvidia-gridd[2274]: Acquiring license. (Info: nvidiadls02.it.ntnu.no; NVIDIA Virtual Compute Server) Mar 19 14:33:49 DEMO nvidia-gridd[2274]: License acquired successfully. (Info: nvidiadls02.it.ntnu.no, NVIDIA Virtual Compute Server; Expiry: 2024-3-20 14:33:4>