...
The compute-node was re-booted, and the VM's hosted on the node was re-started. Previous experience have shown us that VM's survives such crashes just fine, and should boot when the compute-node is restarted, and the technical fix for this incident was considered to be done.
A case is created with our hardware vendor to find, and hopefully solve the problems with the compute-node, and we expected the VM's to be fine. In the mean-time the compute-node is let in production to see if it was a wierd non-re-occouring error, or if it happens again.
Storage-issues for the affected VM's.
...
When the failure was discovered all VM's on compute310 was again stopped, all locks for the affected images was removed, and the VM's were re-started. So, at 11:35 all VM's were up and running. This time verified that some of them actually booted correctly.
Event log
Timestamp | Event |
---|---|
12.01.23 ~08:00 | compute310 crashed |
12.01.23 10:21 | User is reporting a broken disk on a VM affected by the crash |
12.01.23 11:35 | Disk errors fixed by removing locks in ceph |
30.01.23 07:13 | compute310 crashed again |
30.01.23 08:42 | All VMs moved to other hosts and removed all dangling ceph locks |
13.02.23 | Root cause for the dangling file locks was found, and we corrected our configuration accordingly. In the pacific release, ceph changed This stopped the compute-node to release the old file lock when it rebooted. |
07.04.23 19:37 | compute310 crashed again (during easter of course..) and now one of the CPUs seems to be completly dead. That dead CPU is likely to be blamed for all further crashes |
08.04.23 02:42 | compute310 crashed again (during easter of course..) |
08.04.23 09:37 | compute310 crashed again (during easter of course..) |
MANY MORE TIMES | compute310 crashed again (during easter of course..) |
11.04 | Contacted Dell to replace the CPU, and migrated all VMs to a working node |
13.04 13:09 | Motherboard has been replaced, and compute310 is back in production. VMs has been migrated back to it |
14.04 02:16 | compute310 crashed again with the same error, even though the motherboard was replaced. |
14.04 ~09:00 | Migrated off all VMs, and reported back to Dell |
14.04 14:17 | Got confirmation that the (probably) faulty CPU will be replaced on Monday. |
Footnotes:
Footnotes Display |
---|