...
Timestamp | Event |
---|---|
12.01.23 ~08:00 | compute310 crashed |
12.01.23 10:21 | User is reporting a broken disk on a VM affected by the crash |
12.01.23 11:35 | Disk errors fixed by removing locks in ceph |
30.01.23 07:13 | compute310 crashed again |
30.01.23 08:42 | All VMs moved to other hosts and removed all dangling ceph locks |
13.02.23 | Root cause for the dangling file locks was found, and we corrected our configuration accordingly. In the pacific release, ceph changed This stopped the compute-node to release the old file lock when it rebooted. |
07.04.23 19:37 | compute310 crashed again (during easter of course..) and now one of the CPUs seems to be completly dead. That dead CPU is likely to be blamed for all further crashes |
08.04.23 02:42 | compute310 crashed again (during easter of course..) |
08.04.23 09:37 | compute310 crashed again (during easter of course..) |
MANY MORE TIMES | compute310 crashed again (during easter of course..) |
11.04 | Contacted Dell to replace the CPU, and migrated all VMs to a working node |
13.04 13:09 | Motherboard has been replaced, and compute310 is back in production. VMs has been migrated back to it |
14.04 02:16 | compute310 crashed again with the same error, even though the motherboard was replaced. |
14.04 ~09:00 | Migrated off all VMs, and reported back to Dell |
14.04 14:17 | Got confirmation that the (probably) faulty CPU will be replaced on Monday. |
Footnotes:
Footnotes Display |
---|
...