Initial failure
Early morning today, we discovered that three of our GPU nodes had lost their network connectivity. This was reported by our monitoring system, and affected both management, storage and VM traffic.
Failure description
Upon further investigation, we saw symptoms that was quite familiar to us. Layer 1 was working, as the LACP links was established, but there was no communication further up in the networking stack. We have seen this behavior before, and it is caused by a rare, known bug in openvswitch. This has been fixed in later versions of openswtich, and we had simply forgotten to upgrade the packages on these three nodes the last time they were reinstalled.
Implemented fix
Openvswitch is upgraded on the three GPU nodes in question, and has been rebooted. As a natural side-effect, all VMs running on these servers was rebooted as well.
Event log
Time | Event |
---|---|
28.02.23 - 06:11 | gpu304 lost network connectivity |
28.02.23 - 06:28 | gpu302 lost network connectivity |
28.02.23 - 06:35 | gpu301 lost network connectivity |
28.02.23 - 08:00 | SkyHiGh operators arrived at work, and started working on the issue |
28.02.23 - 09:24 | All three affected nodes was fixed and returned to production |