Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TimestampEvent
12.01.23 ~08:00compute310 crashed
12.01.23 10:21User is reporting a broken disk on a VM affected by the crash
12.01.23 11:35Disk errors fixed by removing locks in ceph
30.01.23 07:13compute310 crashed again
30.01.23 08:42All VMs moved to other hosts and removed all dangling ceph locks
13.02.23

Root cause for the dangling file locks was found, and we corrected our configuration accordingly.

In the pacific release, ceph changed osd blacklist to osd blocklist, but we failed to update the permission scheme with the new command.

This stopped the compute-node to release the old file lock when it rebooted.

07.04.23 19:37compute310 crashed again (during easter of course..) and now one of the CPUs seems to be completly dead. That dead CPU is likely to be blamed for all further crashes
08.04.23 02:42compute310 crashed again (during easter of course..)
08.04.23 09:37compute310 crashed again (during easter of course..)
MANY MORE TIMEScompute310 crashed again (during easter of course..)
11.04Contacted Dell to replace the CPU, and migrated all VMs to a working node
13.04 13:09Motherboard has been replaced, and compute310 is back in production. VMs has been migrated back to it


Footnotes:

Footnotes Display

...