Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TimestampEvent
17.05.23 07:23compute108 rebooted on its own with "CPU machine check error"
22.05.23 19:03compute308 rebooted on its own with "CPU machine check error"
23.05.23 ~08:00The unexpected reboots were discovered by the SkyHiGh team. The VMs of both compute nodes were migrated to other hosts, and taken out of production
23.05.23 ~09:00Contacted Dell pro-support to get assistance
24.05.23 15:58Dell suggests a new BIOS for compute108
25.05.23 09:43compute108 now has the recommended BIOS, and all VMs has been migrated back. We are now waiting to see if this actually fixed the problem. Be aware that we now may experience a new uncontrolled reboot..
02.06.23compute308 got its motherboard and a CPU replaced. Dell's recommended BIOS was also installed.
05.06.23 08:05compute308 was put back into production, and all VMs were migrated back.
26.01.24 06:07compute108 "finally" failed again with "CPU machine check error", and performed an uncontrolled reboot.
28.01.24 11.01compute308 failed again with the same error, despite having had its motherboard and CPU replaced in June..
30.01.24 ~09:00Migrated all VMs off compute108, to make it ready for a new motherboard replacement. Replacement will happen on February 2nd
30.01.24 13:47Contacted Dell about compute308. Awaiting response. Meanwhile the node is back in production
31.01.24 15:25Compute108 got its motherboard replaced, and was put back into production. Migrated all VMs back to it.
12.02.24 08:51Upgraded BIOS, and sent new logs from compute308 to Dell as requested.
16.02.24 04:05Compute108 failed yet again with CPU 2 Machine Check Error. Dell has been contacted, and the server will be taken out of production
19.02.24Dell decided to replace the CPU in compute108. Will be done on the 22nd of February
22.02.24 14:20compute108 got a new CPU. Server is back in production, and all VMs will be migrated back to it
05.03.24 06:43compute308 got a new CPU 2 machine check error and a following sudden reboot. The node will be taken out of production. Dell ProSupport has been contacted once again.