You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Initial failure

Our compute nodes compute108 and compute308 both experienced an uncontrolled reboot, and reported "CPU machine check error". Compute308 has had this issue before, and have previously gotten its motherboard replaced in an attempt to fix the issue. For compute108 this is new, but it is also now the fourth of our identically specced Dell PowerEdge R650s with this symptom.

Event log

TimestampEvent
17.05.23 07:23compute108 rebooted on its own with "CPU machine check error"
22.05.23 19:03compute308 rebooted on its own with "CPU machine check error"
23.05.23 ~08:00The unexpected reboots were discovered by the SkyHiGh team. The VMs of both compute nodes were migrated to other hosts, and taken out of production
23.05.23 ~09:00Contacted Dell pro-support to get assistance
24.05.23 15:58Dell suggests a new BIOS for compute108
25.05.23 09:43compute108 now has the recommended BIOS, and all VMs has been migrated back. We are now waiting to see if this actually fixed the problem. Be aware that we now may experience a new uncontrolled reboot..
02.06.23compute308 got its motherboard and a CPU replaced
  • No labels