...
The servers run linux and gets security patches continously, but if reboot is needed, we reboot during the normal patching of linux servers.
Schedule
The schedule is announced on https://varsel.it.ntnu.no/ and is every fourth wednesday in each month or third if it's december.
Reboot order
The openstack are rebooted as follows
...
Patching should be finnished before 23:00, but experience shows that it's finished at around 20:00.
Patching procedures
Storage nodes
Log in to a ceph monitor (cephmon0, 1 or 2) and run the command "watch -n 1 ceph -s". Verify the following :
Code Block |
---|
# health: should be ok
health: HEALTH_OK
# mon: should be 3 daemons and have quorum
# osd: all should be up, as of this example 50 of 50 are up.
services:
mon: 3 daemons, quorum cephmon0,cephmon1,cephmon2
mgr: cephmon0(active), standbys: cephmon1, cephmon2
osd: 50 osds: 50 up, 50 in
rgw: 1 daemon active
data:
pools: 10 pools, 880 pgs
objects: 1.39M objects, 5.59TiB
usage: 16.8TiB used, 74.2TiB / 91.0TiB avail
pgs: 878 active+clean
2 active+clean+scrubbing+deep
io:
client: 8.16KiB/s rd, 2.01MiB/s wr, 105op/s rd, 189op/s wr
|
When everything is ok, reboot first node and await for ceph to be ok again before doing the next.
Compute nodes
Verify the instances running on the compute node
Code Block |
---|
openstack server list --all --host compute01
+--------------------------------------+--------------------+--------+-----------------------------------------+---------------------------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------------+--------+-----------------------------------------+---------------------------------------------+-----------+
| 5c32f1d1-2f12-1234-beffe112345ceffe1 | kubertest-master-2 | ACTIVE | kubertest=10.2.0.7, 129.241.152.9 | CoreOS 20190501 | m1.xlarge |
+--------------------------------------+--------------------+--------+-----------------------------------------+---------------------------------------------+-----------+ |
- Check if one or more of the instances have ok network. (ping). Some instances might have security groups making checking them impossible.
- Check if there are no more than 1 kube master on a compute node. They require quorum, so moving a master is needed if there are two instances of the same master on one compute node
Reboot the compute node when all is ok and verify that the compute node comes up, the instances are all active and the instances tested with ping still have network.
Continue to reboot one and one node checking each compute node as described above.
Infra nodes
Complete shutdown/power up procedures.
Overview
Sometimes it's needed to shut down the whole stack, and then it's important to do it in the right order to ensure quorum is maintaned.
Turn off monitoring
Monitoring will cause a lot of alarms during shutdown, so it can be smart to turn off the sensu.
Code Block |
---|
for a in $(seq 0 2); do ssh sensu$a halt; done |
Compute nodes
Power off all the compute nodes first.
Infra nodes
Step 1 : Turn off autostart of vm's
Ensure that all vm's do not autostart on each infra node (infra0, 1 and 2)
Code Block |
---|
# List all vm's with autostart, should be none.
virsh list --all --name --autostart
# If any instances are set with autostart, disable it.
virsh autostart <vm> --disable |
Step 2 : Turn of vm's/services
1 -Turn off radosgw
Code Block |
---|
for a in $(seq 0 2); do ssh radosgw$a "halt" ; done |
2 - Mass turn off other vm's. Ignore warning about apimon0 and apimon2 is non existing.
Code Block |
---|
# Friendly one liner
for a in adminlb apimon cache cinder glance heatapi heatengine horizon keystone munin neutronapi novaapi novaservices puppetdb servicelb; do for b in $(seq 0 2); do ssh $a$b "halt" ; done ; done
# Broken down for readability
for a in adminlb apimon cache cinder glance heatapi heatengine horizon keystone munin neutronapi novaapi novaservices puppetdb redis servicelb;
do for b in $(seq 0 2);do
ssh $a$b "halt"
done
done |
3 - Storage noder
Verify on a cephmon 0 i/o
Shut down storage01 to 05.
Code Block |
---|
for a in $(seq 1 5); do ssh storage0$a halt ; done |
4 - postgres
Find the master. Use ip addr show and note the one with two interfaces
Code Block |
---|
# Oneliner to find which are the master
for a in $(seq 0 2); do [ $(ssh postgres$a ip addr show | grep -c inet) -eq 5 ] && echo postgres$a is master; done |
Note which are the master
Shut down the non masters
Shut down the master
5 - mysql
Shut down one by one.
Code Block |
---|
# for a in $(seq 0 2); do ssh mysql$a halt; sleep 60 ; done
# On each mysql node
systemctl status mysql
systemctl stop mysql
systemctl status mysql
halt |
6 - Kanin
Code Block |
---|
# for a in $(seq 0 2); do ssh kanin$a halt; sleep 15; done
# On each kanin node
systemctl status rabbitmq-server
systemctl stop rabbitmq-server
systemctl status rabbitmq-server
halt |
7 - Redis
Check which redis is master on http://adminapi.stack.it.ntnu.no:9000/
( the one with green line)
Turn of the others first.
Redis-cli command to find which role a redis server has:
Code Block |
---|
# On each redis server
redis-cli -a "<password>" info replication | grep role
# Masters will print "role:master" and slaves will print "role:slave" |
Note which is last down and power it first on
Code Block |
---|
check : http://adminapi.stack.it.ntnu.no:9000/ # On non masters systemctl status redis systemctl stop redis systemctl status redis halt |
7 - Shut down cephmon
Code Block |
---|
for a in $(seq 0 2); do ssh cephmon$a halt; done |
8 - Turn off infra nodes
Power on
1 - Power on the infra nodes.
2 - mysql
Lag virsh kommando. Husk siste først.
ssh <sist ned mysql>
Det som er ett triks som virker er å
endre my.cnf på den noden du stoppet sist og fjerne adressen til andre
noder i clusteret. Da starter den alene, også kan du start de andre som
kobler seg til den første.
/etc/mysql/my.conf
wsrep_cluster_address = gcomm://10.212.0.53,10.212.0.60,10.212.0.61
→
wsrep_cluster_address = gcomm://
systemcl restart mysqld
boot de to andre
De skal bli en del av cluster
på hvilken som helst
mysql
show status;
| wsrep_cluster_size | 3
Skal være 3.
Når 3 : første mysql server, sett tilbake wsrep og restart mysql.
3 - postgres
Skru på master.
systemctl status postgres - Ok - Ev finne ut hvordan postgres kjører.
Skru på de andre
4 -
Kanin, skru på den som ble sist avslått. Vent, (her vil det være kun 1 node)
rabbitmqctl cluster_status
Cluster status of node rabbit@kanin0 ...
[{nodes,[{disc,[rabbit@kanin0,rabbit@kanin1,rabbit@kanin2]}]},
{running_nodes,[rabbit@kanin2,rabbit@kanin1,rabbit@kanin0]},
{cluster_name,<<"rabbit@kanin2.iaas.ntnu.no">>},
{partitions,[]},
{alarms,[{rabbit@kanin2,[]},{rabbit@kanin1,[]},{rabbit@kanin0,[]}]}]
Slå på de to andre
5 -cepmon
boot
ceph -s
6 - boot storage
vent til ceph -s health ok.
Lars Erik ødelegger nummereringa her:
skru på en adminlb, en servicelb og minst en puppetdb før røkla
7 røkla.
Bruk openstack kommando for å se at ting virker tm.
8 restart openstack nett på infra nodene dersom nett ikke ok.
bjarneskpc:~$ openstack network agent list --sort-column Host
+--------------------------------------+----------------------+-----------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+----------------------+-----------+-------------------+-------+-------+---------------------------+
| 10b839c3-9fb8-4fef-aba0-0eeae23add42 | Open vSwitch agent | compute01 | None | :-) | UP | neutron-openvswitch-agent |
| 8263f2a2-9eba-46f2-9cdf-daeb05124918 | Open vSwitch agent | compute02 | None | :-) | UP | neutron-openvswitch-agent |
| ecd77d21-6839-4d5a-abaa-10e5dcd64afd | Open vSwitch agent | compute03 | None | :-) | UP | neutron-openvswitch-agent |
| e6461a14-f278-4ee1-9575-fd3cfc208604 | Open vSwitch agent | compute04 | None | :-) | UP | neutron-openvswitch-agent |
| 4e040257-af95-401d-b940-44968ba053ba | Open vSwitch agent | compute05 | None | :-) | UP | neutron-openvswitch-agent |
| a6083a53-06a5-4588-a52b-a5cebb94be5b | Open vSwitch agent | compute06 | None | :-) | UP | neutron-openvswitch-agent |
| 68356d55-8562-48e1-a4c7-38c432dd3fc8 | Open vSwitch agent | compute08 | None | :-) | UP | neutron-openvswitch-agent |
| 9e011ede-c3b1-4aed-b6ba-160f67be1f61 | Open vSwitch agent | compute09 | None | :-) | UP | neutron-openvswitch-agent |
| 92535f51-a764-48af-889b-381a8ca77222 | DHCP agent | infra00 | nova | :-) | UP | neutron-dhcp-agent |
| 9872f4a7-1066-4862-9cb8-51a5f3add6b5 | Loadbalancerv2 agent | infra00 | None | :-) | UP | neutron-lbaasv2-agent |
| a3aa759e-7fbc-43fa-b9eb-47d85a23981e | Metadata agent | infra00 | None | :-) | UP | neutron-metadata-agent |
| d597e148-12f5-4bdd-bd05-db5b936be393 | Open vSwitch agent | infra00 | None | :-) | UP | neutron-openvswitch-agent |
| d728e0c2-2f72-4e39-ab2b-431f33efd0c5 | L3 agent | infra00 | nova | :-) | UP | neutron-l3-agent |
| 4be71b64-c102-4e86-892d-76f0b3d43881 | Loadbalancerv2 agent | infra01 | None | :-) | UP | neutron-lbaasv2-agent |
| 6b07607e-bf5d-4317-9c58-300e7af2c2ea | Open vSwitch agent | infra01 | None | :-) | UP | neutron-openvswitch-agent |
| 849f7738-f5d4-4e31-a7bf-7f94fca29cc2 | L3 agent | infra01 | nova | :-) | UP | neutron-l3-agent |
| 998d3169-15fe-49f7-b31a-79f787851680 | Metadata agent | infra01 | None | :-) | UP | neutron-metadata-agent |
| f73bb906-986d-4e28-8fdf-982aae1a1790 | DHCP agent | infra01 | nova | :-) | UP | neutron-dhcp-agent |
| 25aea9b5-0f0b-47dc-8d7d-3dcef9d67d50 | Metadata agent | infra02 | None | :-) | UP | neutron-metadata-agent |
| 9f487c51-1599-4924-942b-a0f45905a84c | L3 agent | infra02 | nova | :-) | UP | neutron-l3-agent |
| b03b1721-6073-4869-b2c7-e98afeab4c47 | Open vSwitch agent | infra02 | None | :-) | UP | neutron-openvswitch-agent |
| cf42e771-dbb4-4439-88c7-2fb02ea5613d | Loadbalancerv2 agent | infra02 | None | :-) | UP | neutron-lbaasv2-agent |
| dcf7bf04-5634-4ad4-ba90-5c75254b80f5 | DHCP agent | infra02 | nova | :-) | UP | neutron-dhcp-agent |
+--------------------------------------+----------------------+-----------+-------------------+-------+-------+---------------------------+
Dersom restart, systemctl restart
systemctl restart neutron-dhcp-agent.service neutron-lbaasv2-agent.service neutron-openvswitch-agent.service neutron-l3-agent.service neutron-metadata-agent.service neutron-ovs-cleanup.service
10 - boot compute noder. Pass på motsatt rekkefølge (quorum på ting i openstacken).
11 openstack compute service list
UP is good
bjarneskpc:~$ openstack compute service list
+-----+------------------+---------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+-----+------------------+---------------+----------+---------+-------+----------------------------+
| 30 | nova-compute | compute01 | nova | enabled | up | 2019-06-19T12:44:39.000000 |
| 33 | nova-compute | compute02 | nova | enabled | up | 2019-06-19T12:44:38.000000 |
| 36 | nova-compute | compute03 | nova | enabled | up | 2019-06-19T12:44:37.000000 |
| 212 | nova-compute | compute04 | nova | enabled | up | 2019-06-19T12:44:36.000000 |
| 215 | nova-compute | compute05 | nova | enabled | up | 2019-06-19T12:44:37.000000 |
| 224 | nova-compute | compute06 | nova | enabled | up | 2019-06-19T12:44:39.000000 |
| 233 | nova-conductor | novaservices0 | internal | enabled | up | 2019-06-19T12:44:31.000000 |
| 239 | nova-scheduler | novaservices0 | internal | enabled | up | 2019-06-19T12:44:40.000000 |
| 248 | nova-consoleauth | novaservices0 | internal | enabled | up | 2019-06-19T12:44:34.000000 |
| 255 | nova-consoleauth | novaservices1 | internal | enabled | up | 2019-06-19T12:44:32.000000 |
| 258 | nova-scheduler | novaservices1 | internal | enabled | up | 2019-06-19T12:44:40.000000 |
| 261 | nova-conductor | novaservices1 | internal | enabled | up | 2019-06-19T12:44:32.000000 |
| 264 | nova-scheduler | novaservices2 | internal | enabled | up | 2019-06-19T12:44:32.000000 |
| 267 | nova-consoleauth | novaservices2 | internal | enabled | up | 2019-06-19T12:44:37.000000 |
| 270 | nova-conductor | novaservices2 | internal | enabled | up | 2019-06-19T12:44:35.000000 |
| 275 | nova-compute | compute08 | nova | enabled | up | 2019-06-19T12:44:40.000000 |
| 281 | nova-compute | compute09 | nova | enabled | up | 2019-06-19T12:44:38.000000 |
+-----+------------------+---------------+----------+---------+-------+----------------------------+
12 - Kjør script for testing.
13 - Ring gjøvik.