Today we experienced issues with authentication in both openstack-platforms at Gjøvik (SkyHiGh, and its test-environment SkyLow). We observed that openstack keystone, the authentication component of openstack, started to report that it was offline.

Users reported issues with accessing the both the openstack web-interface (horizon) and the openstack API's, which makes sense when the authentication service doesnt work as it should.

Failure description

The issues with the authentication-service was quickly identified being related to the communication between keystone and NTNU-AD (win.ntnu.no), and it was first investigated if the outtage was related to talks about changing the auth-regime for openstacks service-users (We had talked about that earlier today; so it felt a bit related since the two platforms using the given service-user failed at the same time), but realizing that this was not the case we investigated further.

Next we realized that keystone could not communicate with the win.ntnu.no servers, specificly not over IPv6. IPv6 traffic were flowing fine from client-networks in Gjøvik to skyhigh, so the route-exchange between openstack core-routers and NTNU IT's routers at Gjøvik were confirmed working. Traffic was also flowing fine from client-networks in gjøvik and win.ntnu.no servers, but the traffic were not flowing between skyhigh and any networks in Trondheim (or the Internet). This made us confident that the issue was related to NTNU IT's infrastructure, and not the openstack-infrastructure, and NTNU's nettvakt was notified.

They confirmed that they just performed a change, and in relation to that change they performed some unrelated cleanup of some configuration. That unrelated cleanup was indeed "load-bearing", and resulted in the openstack platform lost IPv6 communication with Trondheim. And as keystone used IPv6 to reach the domain-controllers in Trondheim, that affected the platform badly.

The case is now handled by NTNU IT's incident-response regime; and we hope that it does not happen again (tongue)

Event Log

TimeEvent
14.02.23 - 13:46First indications that keystone stopped working
14.02.23 - 13:48First users asking if we knew about any issues, as they experienced difficulties reaching both API's and the web-interface.
14.02.23 - 14:13NTNU Nettvakt was notified that we experienced IPv6 routing-issues. They confirmed a very recent related change. They will investigate, and either roll back or implement a work-around.
14.02.23 - 14:31NTNU Nett reports that they have implemented a work-around.
14.02.23 - 14:32Systems are confirmed working again; Nettvakt is notified that it is OK again, and they will start their incident-review process.
  • No labels