Initial failure

Early morning today, we discovered that three of our GPU nodes had lost their network connectivity. This was reported by our monitoring system, and affected both management, storage and VM traffic.

Failure description

Upon further investigation, we saw symptoms that was quite familiar to us. Layer 1 was working, as the LACP links was established, but there was no communication further up in the networking stack. We have seen this behavior before, and it is caused by a rare, known bug in openvswitch. This has been fixed in later versions of openswtich, and we had simply forgotten to upgrade the packages on these three nodes the last time they were reinstalled.

Implemented fix

Openvswitch is upgraded on the three GPU nodes in question, and has been rebooted. As a natural side-effect, all VMs running on these servers was rebooted as well.

Event log

Time	Event
28.02.23 - 06:11	gpu304 lost network connectivity
28.02.23 - 06:28	gpu302 lost network connectivity
28.02.23 - 06:35	gpu301 lost network connectivity
28.02.23 - 08:00	SkyHiGh operators arrived at work, and started working on the issue
28.02.23 - 09:24	All three affected nodes was fixed and returned to production

Page tree

2023 - 02 - 28 Some GPU nodes lost network connectivity

Initial failure

Failure description

Implemented fix

Event log