...
Our compute nodes compute108 and compute308 both experienced an uncontrolled reboot, and reported "CPU machine check error". Compute308 has had this issue before, and have previously gotten its motherboard replaced in an attempt to fix the issue. For compute108 this is new, but it is also now the fourth of our identically specced Dell PowerEdge R650s with this symptom.
Event log
Timestamp | Event |
---|---|
17.05.23 07:23 | compute108 rebooted on its own with "CPU machine check error" |
22.05.23 19:03 | compute308 rebooted on its own with "CPU machine check error" |
23.05.23 ~08:00 | The unexpected reboots were discovered by the SkyHiGh team. The VMs of both compute nodes were migrated to other hosts, and taken out of production |
23.05.23 ~09:00 | Contacted Dell pro-support to get assistance |
24.05.23 15:58 | Dell suggests a new BIOS for compute108 |
25.05.23 09:43 | compute108 now has the recommended BIOS, and all VMs has been migrated back. We are now waiting to see if this actually fixed the problem. Be aware that we now may experience a new uncontrolled reboot.. |
02.06.23 | compute308 got its motherboard and a CPU replaced |