You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Current »

Ongoing incident

Incident description

The Nvidia GRID license server (nvidiadls02.it.ntnu.no) we use to serve VGPU licenses for GPU-enabled VMs in all of NTNUs Openstack platforms has been reinstalled without anyone telling us. This is a result of missing documentation from NTNU IT's side. Due to the lack of documentation, the engineer thought that the server was not in use, and could be reinstalled without bothering any users.

Impact

New GPU VMs will not be able to retrive a license, and the vGPU will not work. Running VMs will over time lose their license, and will lose it upon a reboot.

Event log

TimeEvent
15.03.24The server was reinstalled by NTNU IT
19.03.24 - 13:49We discovered that new GPU VMs was no longer able to aquire a license - and a few minutes later it became obvious that the server had been reinstalled
19.03.24 - 14:06The engineer that was involved in setting this up in June last year was contacted. Admits that he has indeed reinstalled this server.
19.03.24 - 15:36The license server has been reconfigured, and is now working again. All running VMs must download a new client configuration token to be able to acquire/renew the license
19.03.24 - 16:07All affected users has been informed by email.

Implemented fix

The license server has been reconfigured from scratch. This means that all existing users/running VMs must download a new client configuration token in order to acquire/renew the license. This is done by running the following commands as root inside the VM:


wget https://rpm.iik.ntnu.no/nvidia/gridd.tok -O /etc/nvidia/ClientConfigToken/gridd.tok
systemctl restart nvidia-gridd.service

# Then verify that the nvidia-grid daemon successfully acquired a license
systemctl status nvidia-gridd.service
● nvidia-gridd.service - NVIDIA Grid Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-gridd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2024-03-19 14:33:47 UTC; 9s ago
    Process: 2273 ExecStart=/usr/bin/nvidia-gridd (code=exited, status=0/SUCCESS)
   Main PID: 2274 (nvidia-gridd)
      Tasks: 4 (limit: 144861)
     Memory: 1.6M
     CGroup: /system.slice/nvidia-gridd.service
             └─2274 /usr/bin/nvidia-gridd

Mar 19 14:33:47 DEMO systemd[1]: Starting NVIDIA Grid Daemon...
Mar 19 14:33:47 DEMO systemd[1]: Started NVIDIA Grid Daemon.
Mar 19 14:33:47 DEMO nvidia-gridd[2274]: Started (2274)
Mar 19 14:33:47 DEMO nvidia-gridd[2274]: Configuration parameter ( ServerAddress  ) not set
Mar 19 14:33:47 DEMO nvidia-gridd[2274]: vGPU Software package (0)
Mar 19 14:33:47 DEMO nvidia-gridd[2274]: Ignore service provider and node-locked licensing
Mar 19 14:33:47 DEMO nvidia-gridd[2274]: NLS initialized
Mar 19 14:33:47 DEMO nvidia-gridd[2274]: Acquiring license. (Info: nvidiadls02.it.ntnu.no; NVIDIA Virtual Compute Server)
Mar 19 14:33:49 DEMO nvidia-gridd[2274]: License acquired successfully. (Info: nvidiadls02.it.ntnu.no, NVIDIA Virtual Compute Server; Expiry: 2024-3-20 14:33:4>



  • No labels