Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The compute-node was re-booted, and the VM's hosted on the node was re-started. Previous experience have shown us that VM's survives such crashes just fine, and should boot when the compute-node is restarted, and the technical fix for this incident was considered to be done.

A case is created with our hardware vendor to find, and hopefully solve the problems with the compute-node, and we expected the VM's to be fine. In the mean-time the compute-node is let in production to see if it was a wierd non-re-occouring error, or if it happens again.

Storage-issues for the affected VM's.

...

When the failure was discovered all VM's on compute310 was again stopped, all locks for the affected images was removed, and the VM's were re-started. So, at 11:35 all VM's were up and running. This time verified that some of them actually booted correctly.


Event log

TimestampEvent
12.01.23 ~08:00compute310 crashed
12.01.23 10:21User is reporting a broken disk on a VM affected by the crash
12.01.23 11:35Disk errors fixed by removing locks in ceph
30.01.23 07:13compute310 crashed again
30.01.23 08:42All VMs moved to other hosts and removed all dangling ceph locks
13.02.23

Root cause for the dangling file locks was found, and we corrected our configuration accordingly.

In the pacific release, ceph changed osd blacklist to osd blocklist, but we failed to update the permission scheme with the new command.

This stopped the compute-node to release the old file lock when it rebooted.

07.04.23 19:37compute310 crashed again (during easter of course..) and now one of the CPUs seems to be completly dead. That dead CPU is likely to be blamed for all further crashes
08.04.23 02:42compute310 crashed again (during easter of course..)
08.04.23 09:37compute310 crashed again (during easter of course..)
MANY MORE TIMEScompute310 crashed again (during easter of course..)
11.04Contacted Dell to replace the CPU, and migrated all VMs to a working node
13.04 13:09Motherboard has been replaced, and compute310 is back in production. VMs has been migrated back to it
14.04 02:16compute310 crashed again with the same error, even though the motherboard was replaced.
14.04 ~09:00Migrated off all VMs, and reported back to Dell
14.04 14:17Got confirmation that the (probably) faulty CPU will be replaced on Monday.

Footnotes:

Footnotes Display