You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Overview

The servers run linux and gets security patches continously, but if reboot is needed, we reboot during the normal patching of linux servers.

Schedule

The schedule is announced on https://varsel.it.ntnu.no/ and is every fourth wednesday in each month or third if it's december.

Reboot order

The openstack are rebooted as follows

Patch day before 16:00

  • Storage nodes. Rebooted one by one and does not cause any interference with the openstack availability. Two nodes can be down and we still have the Ceph storage available, and since Ceph is verified up and running 100% before next node is rebooted it's safe.

Patch day after 16:00

  • Compute nodes. This will cause the instances running on that node to be shut down before the compute node reboots and they will be started when the node is down. The instance will be unavailable ~10 to 15 minutes when this is happening.
  • Infrastructure nodes. There are three infrastructure nodes with all the services running on each behind a load balancer. There might be small delay in network access to the instance when the loadbalancer changes it's target or the active loadbalancer is taken down.

Patching should be finnished before 23:00, but experience shows that it's finished at around 20:00.

Patching procedures

Storage nodes

Log in to a ceph monitor (cephmon0, 1 or 2) and run the command "watch -n 1 ceph -s". Verify the following :

# health: should be ok
    health: HEALTH_OK

# mon: should be 3 daemons and have quorum
# osd: all should be up, as of this example 50 of 50 are up.
  services:
    mon: 3 daemons, quorum cephmon0,cephmon1,cephmon2
    mgr: cephmon0(active), standbys: cephmon1, cephmon2
    osd: 50 osds: 50 up, 50 in
    rgw: 1 daemon active

  data:
    pools:   10 pools, 880 pgs
    objects: 1.39M objects, 5.59TiB
    usage:   16.8TiB used, 74.2TiB / 91.0TiB avail
    pgs:     878 active+clean
             2   active+clean+scrubbing+deep

  io:
    client:   8.16KiB/s rd, 2.01MiB/s wr, 105op/s rd, 189op/s wr


When everything is ok, reboot first node and await for ceph to be ok again before doing the next.

Compute nodes

Verify the instances running on the compute node

openstack server list --all --host compute01
+--------------------------------------+--------------------+--------+-----------------------------------------+---------------------------------------------+-----------+
| ID                                   | Name               | Status | Networks                                | Image                                       | Flavor    |
+--------------------------------------+--------------------+--------+-----------------------------------------+---------------------------------------------+-----------+
| 5c32f1d1-2f12-1234-beffe112345ceffe1 | kubertest-master-2 | ACTIVE | kubertest=10.2.0.7, 129.241.152.9       | CoreOS 20190501                             | m1.xlarge |
+--------------------------------------+--------------------+--------+-----------------------------------------+---------------------------------------------+-----------+
  • Check if one or more of the instances have ok network. (ping). Some instances might have security groups making checking them impossible.
  • Check if there are no more than 1 kube master on a compute node. They require quorum, so moving a master is needed if there are two instances of the same master on one compute node

Reboot the compute node when all is ok and verify that the compute node comes up, the instances are all active and the instances tested with ping still have network.

Continue to reboot one and one node checking each compute node as described above.

Infra nodes

Complete shutdown/power up procedures.

Overview

Sometimes it's needed to shut down the whole stack, and then it's important to do it in the right order to ensure quorum is maintaned.

Turn off monitoring

Monitoring will cause a lot of alarms during shutdown, so it can be smart to turn off the sensu.

for a in $(seq 0 2); do ssh sensu$a halt; done


Compute nodes

Power off all the compute nodes first.

Infra nodes

Step 1 : Turn off autostart of vm's

Ensure that all vm's do not autostart on each infra node (infra0, 1 and 2)

# List all vm's with autostart, should be none.
virsh list --all --name --autostart
# If any instances are set with autostart, disable it.
virsh autostart <vm> --disable

Step 2 : Turn of vm's/services

1 -Turn off radosgw

for a in $(seq 0 2); do ssh radosgw$a "halt" ; done

2 - Mass turn off other vm's. Ignore warning about apimon0 and apimon2

# Friendly one liner
for a in adminlb apimon cache cinder glance heatapi heatengine horizon kanin keystone munin neutronapi novaapi novaservices puppetdb redis sensu servicelb; do for b in $(seq 0 2); do ssh $a$b "halt" ; done ; done

# Broken down for readability
for a in adminlb apimon cache cinder glance heatapi heatengine horizon kanin keystone munin neutronapi novaapi novaservices puppetdb redis sensu servicelb;
  do for b in $(seq 0 2);do
    ssh $a$b "hostname"
  done
done

3 - postgres

Find the master. Use ip addr show and note the one with two interfaces

# Oneliner to find which are the master
for a in $(seq 0 2); do [ $(ssh postgres$a ip addr show | grep -c inet) -eq 5 ] && echo postgres$a is master;  done

Note which are the master

Shut down the non masters

Shut down the master

4 - mysql

5 Verify no io on cephmon

6 Shut down storage nodes

for a in $(seq 1 5); do ssh storage0$a hostname ; done

7 Shut down cephmon

  • No labels