Initial failure

This incident started with a notification from our monitoring-system that one of our compute-nodes (compute310 in SkyHiGh) crashed with hardware failures.¹

The compute-node was re-booted, and the VM's hosted on the node was re-started. Previous experience have shown us that VM's survives such crashes just fine, and should boot when the compute-node is restarted, and the technical fix for this incident was considered to be done.

A case is created with our hardware vendor to find, and hopefully solve the problems with the compute-node. In the mean-time the compute-node is let in production to see if it was a wierd non-re-occouring error, or if it happens again.

Storage-issues for the affected VM's.

At 10:21 a user reports that a disk on one of their VMs are 'broken'. A quick check show us tha this actually is the case for all 60 VM's hosted on compute310. Investigation shows that in this case 'broken' means that the disk is in read-only mode, which means that it doesent really work well. Booted linux-systems reports I/O errors on writes, and windows-systems crashes to a bluescreen. Investigation shows us that this is caused by:

The disks of the VM's are hosted on a distributed storage-cluster (ceph), so that all compute-nodes technically can access the virtual disks. This is good, as it lets us move VM's away from crashed or malfunctioning compute-nodes.
To avoid having two VM's writing to a disk the ceph-cluster employs a locking-mechanism so that a compute-node can signal to the storage-cluster that only that node should be allowed to write to the image. When the compute-node stops the VM, or the VM is moved to a new compute-node, the lock will be released.
- When compute310 crashed, it was unable to release the lock...
When compute310 re-booted, and re-started all the VM's, the existing locks were still in place for these virtual disks, and the compute310 could thus not get write-access to these. This results in all VM's booting with read-only access to their disks.

When the failure was discovered all VM's on compute310 was again stopped, all locks for the affected images was removed, and the VM's were re-started. So, at 11:35 all VM's were up and running. This time verified that some of them actually booted correctly.

Event log

Timestamp	Event
12.01.23 ~08:00	compute310 crashed
12.01.23 10:21	User is reporting a broken disk on a VM affected by the crash
12.01.23 11:35	Disk errors fixed by removing locks in ceph
30.01.23 07:13	compute310 crashed again
30.01.23 08:42	All VMs moved to other hosts and removed all dangling ceph locks
13.02.23	Root cause for the dangling file locks was found, and we corrected our configuration accordingly. In the pacific release, ceph changed `osd blacklist` to `osd blocklist`, but we failed to update the permission scheme with the new command. This stopped the compute-node to release the old file lock when it rebooted.
07.04.23 19:37	compute310 crashed again (during easter of course..) and now one of the CPUs seems to be completly dead. That dead CPU is likely to be blamed for all further crashes
08.04.23 02:42	compute310 crashed again (during easter of course..)
08.04.23 09:37	compute310 crashed again (during easter of course..)
MANY MORE TIMES	compute310 crashed again (during easter of course..)
11.04	Contacted Dell to replace the CPU, and migrated all VMs to a working node

Footnotes:

(CPU 2 machine check error detected; An unexpected system shutdown operation occurred when collecting the internal error log data.) ↩

Page tree

2023 - 01 - 12 - Hardware-failure with compute310 in SkyHiGh

Initial failure

Storage-issues for the affected VM's.

Event log

Footnotes: