Initial failure

Early morning today, we discovered that three of our GPU nodes had lost their network connectivity. This was reported by our monitoring system, and affected both management, storage and VM traffic.

Failure description

Upon further investigation, we saw symptoms that was quite familiar to us. Layer 1 was working, as the LACP links was established, but there was no communication further up in the networking stack. We have seen this behavior before, and it is caused by a rare, known bug in openvswitch. This has been fixed in later versions of openswtich, and we had simply forgotten to upgrade the packages on these three nodes the last time they were reinstalled.

Implemented fix

Openvswitch is upgraded on the three GPU nodes in question, and has been rebooted. As a natural side-effect, all VMs running on these servers was rebooted as well.

Event log

TimeEvent
28.02.23 - 06:11gpu304 lost network connectivity
28.02.23 - 06:28gpu302 lost network connectivity
28.02.23 - 06:35gpu301 lost network connectivity
28.02.23 - 08:00SkyHiGh operators arrived at work, and started working on the issue
28.02.23 - 09:24All three affected nodes was fixed and returned to production
  • No labels