Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TimeEvent
16.12.23 - 20:22The broken floor tile was discovered
16.12.23 - ca. 23:00Agreed that we should remove some weight from R3, and contact help on Monday
18.12.23 - ca. 09:00Contacted a company that will assess the damage, and come up with a plan to fix the floor
18.12.23 - ca. 12:00

Placed a steel beam under R3, to support it.

Migrated all VMs from five of the compute nodes in R3, and removed them from the rack - meaning we are currently running on reduced capacity.

20.12.23Visit from the carpenter company. Made an initial plan for what needed to be done. Decided that the steel beam would suffice for support. Little to no measured further "sinking".
17.01.24 - 10:00

Meeting with the carpenter. A plan was made for repairing the floor. They will build support framing between all the floor tile legs, and replace all necessary tiles.

The carpenters will need two days, and we schedule two days for removing all servers, and one day to put everything back in after the floor is fixed. Meaning a total downtime of five days.

18.01.24 - 13:00Received confirmation from the carpenter, that they can start the repair work on 7th of February. We accepted the offer.
23.01.24 - 15:23Messaged all users about the planned downtime in week 6
05.02.24 - ca. 10:00Shutdown SkyHiGh, SkyLow and everything else. All servers has been removed from the racks.
08.02.24 - ca. 11:00Carpenter work finished.
08.02.24 - ca. 12:00Started the work on moving racks back in place and rewire fibre cables and environmental sensors.
08.02.24 - 16:00All network infrastructure (core and rack switches) is reinstalled and confirmed working as normal. Replaced a broken PDU, and all environmental sensors are confirmed working. 
09.02.24 - 08:30Started to place all SkyHiGh, NBL, DSE and Hansken servers back in their racks.
09.02.24 - 12:15Started ro restart infrastructure services in SkyHiGh. Databases, puppet infrastructure, message queues, ceph cluster etc.
09.02.24 - 14:45Openstack control plane restarted. no VMs running yet. Be patient... slightly smiling face 
09.02.24 - 16:00All SkyHiGh-nodes UP (one GPU-node missing). All VMs that was running when we shut down should be restarted
09.02.24 - 16:45Last GPU-node online
09.02.24 - 17:11All users have been notified that SkyHiGh is back online! grinning face with smiling eyes 

Email sent to all users

Code Block
Hei alle dere som benytter SkyHiGh eller andre servere i K001 på
Gjøvik!

Det har dessverre vist seg å være litt dårlig støtte for skyhigh-
rackene i K001, så gulvet de rackene står på holder på å gi etter, og
noen av rackene er dermed blitt litt skeive. Dette er litt uheldig, og
vi har et behov for å rette litt på denne situasjonen. Vi ser oss
derfor nødt til å fikse gulvet; samt forsterke litt for å unngå at
dette skal skje igjen. Vi har en avtale med snekkere om at de skal
fikse og utbedre, men for at de skal kunne gjøre jobben sin er vi
dessverre nødt til å tømme serverrommet helt for servere.

I praksis betyr dette at alle servertjenester som leveres fra K001 vil
være stoppet i hele uke 6. Vi kommer til å skru av og ta ut servere 5.
og 6. februar, la snekkerene jobbe 7. og 8. februar, og deretter sette
ting tilbake i drift 9.(skyhigh) og 12.(resten) februar. De av dere som
benytter noen av disse serverne for å levere en tjeneste til andre er
selv ansvarlige for å varsle om at tjenestene kommer til å gå ned.

Vi beklager ulempene dette medfører, men må samtidig be om forståelse
for at dette er en ekstraordinær situasjon som faktisk bare _må_
utbedres.

===

Hi all SkyHiGh users, and others using servers in K001 at Gjøvik!

Unfortunately, it has been discovered that there is inadequate support
for the sky-high racks in K001. As a result, a few of the floor-tiles
beneath these racks has broken, causing some of the racks to tilt. This
is an unfortunate situation, and we need to address it promptly.
Therefore, we find ourselves compelled to fix the floor and reinforce
it to prevent a recurrence. We have an agreement with carpenters to
carry out the necessary repairs and improvements. However, to enable
them to perform their work, we regret to inform you that we need to
completely empty the server room of all servers.

In practical terms, this means that all server services provided from
K001 will be halted throughout week 6. We will power off and remove
servers on February 5th and 6th, allow the carpenters to work on
February 7th and 8th, and then restore operations on February 9th (for
sky-high) and February 12th (for the rest). Those of you utilizing
these servers to deliver services to others are responsible for
notifying them that the services will be temporarily disrupted.

We apologize for any inconvenience this may cause, but we must request
your understanding as this is an extraordinary situation that truly
must be addressed.

...