The compute-node was re-booted, and the VM's hosted on the node was re-started. Previous experience have shown us that VM's survives such crashes just fine, and should boot when the compute-node is restarted, and the technical fix for this incident was considered to be done.

A case is created with our hardware vendor to find, and hopefully solve the problems with the compute-node, and we expected the VM's to be fine. In the mean-time the compute-node is let in production to see if it was a wierd non-re-occouring error, or if it happens again.

Storage-issues for the affected VM's.

...

When the failure was discovered all VM's on compute310 was again stopped, all locks for the affected images was removed, and the VM's were re-started. So, at 11:35 all VM's were up and running. This time verified that some of them actually booted correctly.

Event log

Timestamp	Event
12.01.23 ~08:00	compute310 crashed
12.01.23 10:21	User is reporting a broken disk on a VM affected by the crash
12.01.23 11:35	Disk errors fixed by removing locks in ceph
30.01.23 07:13	compute310 crashed again
30.01.23 08:42	All VMs moved to other hosts and removed all dangling ceph locks
13.02.23	Root cause for the dangling file locks was found, and we corrected our configuration accordingly. In the pacific release, ceph changed `osd blacklist` to `osd blocklist`, but we failed to update the permission scheme with the new command. This stopped the compute-node to release the old file lock when it rebooted.
07.04.23 19:37	compute310 crashed again (during easter of course..) and now one of the CPUs seems to be completly dead. That dead CPU is likely to be blamed for all further crashes
08.04.23 02:42	compute310 crashed again (during easter of course..)
08.04.23 09:37	compute310 crashed again (during easter of course..)
MANY MORE TIMES	compute310 crashed again (during easter of course..)
11.04	Contacted Dell to replace the CPU, and migrated all VMs to a working node
13.04 13:09	Motherboard has been replaced, and compute310 is back in production. VMs has been migrated back to it
14.04 02:16	compute310 crashed again with the same error, even though the motherboard was replaced.
14.04 ~09:00	Migrated off all VMs, and reported back to Dell
14.04 14:17	Got confirmation that the (probably) faulty CPU will be replaced on Monday.

Footnotes:

Footnotes Display

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Storage-issues for the affected VM's.

Event log

Footnotes:

Page tree

Page History

Versions Compared

Old Version 2

New Version Current

Key

Storage-issues for the affected VM's.

Event log

Footnotes: