Friday, May 31, 2013

How to fix your PRM cluster when upgrading to RHEL/CentOS 6.4

http://www.mysqlperformanceblog.com/2013/05/29/how-to-fix-your-prm-cluster-when-upgrading-to-rhelcentos-6-4/

Wednesday, May 29, 2013 11:30 AMHow to fix your PRM cluster when upgrading to RHEL/CentOS 6.4MySQL Performance BlogFrederic Descamps

If you are using Percona Replication Manager (PRM) with RHEL/CentOS prior to 6.4, upgrading your distribution to 6.4 may break your cluster. In this post I will explain you how to fix your cluster in case it breaks after a distribution upgrade that implies an update of pacemaker from 1.1.7 to 1.18. You can also follow the official documentation here.

The version of Pacemaker (always considered as Technology Preview by RedHat) provided with 6.4 is 1.1.8-x which is not 100% compatible with 1.1.7-x see this report.

So if you want to upgrade, you cannot apply any rolling upgrade process. So like for Pacemaker 0.6.x to 1.0.x, you need again to update all nodes as once. As notified in RHBA-2013-0375, RedHat encourages people to use Pacemaker in combination with the CMAN manager (It may become mandatory with the next release).

CMAN v3 is a Corosync plugin that monitors the names and number of active cluster nodes in order to deliver membership and quorum information to clients (such as the Pacemaker daemons) and it's part of the RedHat cluster stack. If you were using some puppet recipes published previously here you are not yet using CMAN.

Let's have look at what happens if we have a cluster with 3 nodes (CentOS 6.3) and using PRM as OCF:

[root@percona1 percona]# crm_mon -1
============
Last updated: Thu May 23 08:04:30 2013
Last change: Thu May 23 08:03:41 2013 via crm_attribute on percona2
Stack: openais
Current DC: percona1 – partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
3 Nodes configured, 3 expected votes
7 Resources configured.
============

Online: [ percona1 percona2 percona3 ]

reader_vip_1 (ocf::heartbeat:IPaddr2): Started percona3
reader_vip_2 (ocf::heartbeat:IPaddr2): Started percona2
reader_vip_3 (ocf::heartbeat:IPaddr2): Started percona1
writer_vip (ocf::heartbeat:IPaddr2): Started percona1
Master/Slave Set: ms_MySQL [p_mysql]
Masters: [ percona2 ]
Slaves: [ percona3 percona1 ]

[root@percona1 ~]# cat /etc/redhat-release
CentOS release 6.3 (Final)
[root@percona1 ~]# rpm -q pacemaker
pacemaker-1.1.7-6.el6.x86_64
[root@percona1 ~]# rpm -q corosync
corosync-1.4.1-7.el6_3.1.x86_64

Everything is working 
Let's update our system to 6.4 on one server…

NOTE: In production you should put the cluster in maintenance mode before the update, see bellow how to perform this action

[root@percona1 percona]# yum update -y

[root@percona1 percona]# cat /etc/redhat-release
CentOS release 6.4 (Final)

[root@percona1 ~]# rpm -q pacemaker
pacemaker-1.1.8-7.el6.x86_64
[root@percona1 ~]# rpm -q corosync
corosync-1.4.1-15.el6_4.1.x86_64

Let's reboot it…

[root@percona1 percona]# reboot

If we check the cluster from another node, we see that percona1 is now offline:

============
Last updated: Thu May 23 08:29:36 2013
Last change: Thu May 23 08:03:41 2013 via crm_attribute on percona2
Stack: openais
Current DC: percona3 – partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
3 Nodes configured, 3 expected votes
7 Resources configured.
============

Online: [ percona2 percona3 ]
OFFLINE: [ percona1 ]

reader_vip_1 (ocf::heartbeat:IPaddr2): Started percona2
reader_vip_2 (ocf::heartbeat:IPaddr2): Started percona3
reader_vip_3 (ocf::heartbeat:IPaddr2): Started percona2
writer_vip (ocf::heartbeat:IPaddr2): Started percona3
Master/Slave Set: ms_MySQL [p_mysql]
Masters: [ percona2 ]
Slaves: [ percona3 ]
Stopped: [ p_mysql:2 ]

After the update and after fixing some small issues like the one bellow, you are able to start Corosync and Pacemaker but the node doesn't join the cluster 

May 23 08:34:12 percona1 corosync[1535]: [MAIN ] parse error in config: Can't open logfile '/var/log/corosync.log' for reason: Permission denied (13).#012.

So now you need to update all nodes to Pacemaker 1.1.8 but to avoid again issues with the next distribution update, I prefer to use CMAN as recommended.

First as we have 2 nodes of 3 running, we should try to not stop all our servers… let's put the cluster in maintenance mode (don't forget you should have done this even before updating the first node, but I wanted to simulate the problem):

[root@percona3 percona]# crm configure property maintenance-mode=true

We can see that the resources are unmanaged:

============
Last updated: Thu May 23 08:43:49 2013
Last change: Thu May 23 08:43:49 2013 via cibadmin on percona3
Stack: openais
Current DC: percona3 – partition with quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
3 Nodes configured, 3 expected votes
7 Resources configured.
============

Online: [ percona2 percona3 ]
OFFLINE: [ percona1 ]

reader_vip_1 (ocf::heartbeat:IPaddr2): Started percona2 (unmanaged)
reader_vip_2 (ocf::heartbeat:IPaddr2): Started percona3 (unmanaged)
reader_vip_3 (ocf::heartbeat:IPaddr2): Started percona2 (unmanaged)
writer_vip (ocf::heartbeat:IPaddr2): Started percona3 (unmanaged)
Master/Slave Set: ms_MySQL [p_mysql] (unmanaged)
p_mysql:0 (ocf::percona:mysql): Master percona2 (unmanaged)
p_mysql:1 (ocf::percona:mysql): Started percona3 (unmanaged)
Stopped: [ p_mysql:2 ]

Now we can upgrade all servers to 6.4

[root@percona2 percona]# yum -y update
[root@percona3 percona]# yum -y update

Meanwhile, we can already prepare the first node to use CMAN:

[root@percona1 ~]# yum -y install cman ccs

Back on the two nodes that were updating, they are now updated to 6.4:

[root@percona3 percona]# cat /etc/redhat-release
CentOS release 6.4 (Final)

And let's check the cluster status:

[root@percona3 percona]# crm_mon -1
Could not establish cib_ro connection: Connection refused (111)

Connection to cluster failed: Transport endpoint is not connected…

…but MySQL is still running:

[root@percona2 percona]# mysqladmin ping
mysqld is alive

[root@percona3 percona]# mysqladmin ping
mysqld is alive

Let's install CMAN on percona2 and percona3 too:

[root@percona2 percona]# yum -y install cman ccs
[root@percona3 percona]# yum -y install cman ccs

Then on ALL nodes, stop Pacemaker and Corosync

[root@percona1 ~]# /etc/init.d/pacemaker stop
[root@percona1 ~]# /etc/init.d/corosync stop
[root@percona2 ~]# /etc/init.d/pacemaker stop
[root@percona2 ~]# /etc/init.d/corosync stop
[root@percona3 ~]# /etc/init.d/pacemaker stop
[root@percona3 ~]# /etc/init.d/corosync stop

Remove Corosync from the startup services:

[root@percona1 ~]# chkconfig corosync off
[root@percona2 ~]# chkconfig corosync off
[root@percona3 ~]# chkconfig corosync off

Let's specify that the cluster can start without quorum:

[root@percona1 ~]# sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman
[root@percona2 ~]# sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman
[root@percona3 ~]# sed -i.sed "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman

And create the cluster, perform the following command on one server only:

[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –createcluster lefred_prm

Now add the nodes to the cluster:

[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addnode percona1
Node percona1 added.
[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addnode percona2
Node percona2 added.
[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addnode percona3
Node percona3 added.

we need then to delegate the fencing to pacemaker (adding a fence device, fence methods to specific node and the instances) :

[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addfencedev pcmk agent=fence_pcmk

[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addmethod pcmk-redirect percona1
Method pcmk-redirect added to percona1.
[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addmethod pcmk-redirect percona2
Method pcmk-redirect added to percona2.
[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addmethod pcmk-redirect percona3
Method pcmk-redirect added to percona3.

[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addfenceinst pcmk percona1 pcmk-redirect port=percona1
[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addfenceinst pcmk percona2 pcmk-redirect port=percona2
[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –addfenceinst pcmk percona3 pcmk-redirect port=percona3

Encrypt the cluster:

[root@percona1 ~]# ccs -f /etc/cluster/cluster.conf –setcman keyfile="/etc/corosync/authkey" transport="udpu"

Let's check if the configuration file is OK:

[root@percona1 ~]# ccs_config_validate -f /etc/cluster/cluster.conf
Configuration validates

We can now copy the configuration file on all nodes:

[root@percona1 ~]# scp /etc/cluster/cluster.conf percona2:/etc/cluster/
[root@percona1 ~]# scp /etc/cluster/cluster.conf percona3:/etc/cluster/

Enable CMAN at startup on all nodes:

[root@percona1 ~]# chkconfig cman on
[root@percona2 ~]# chkconfig cman on
[root@percona3 ~]# chkconfig cman on

And start the services on all nodes:

[root@percona1 ~]# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot… [ OK ]
Checking Network Manager… [ OK ]
Global setup… [ OK ]
Loading kernel modules… [ OK ]
Mounting configfs… [ OK ]
Starting cman… [ OK ]
Waiting for quorum… [ OK ]
Starting fenced… [ OK ]
Starting dlm_controld… [ OK ]
Tuning DLM kernel config… [ OK ]
Starting gfs_controld… [ OK ]
Unfencing self… [ OK ]
Joining fence domain… [ OK ]
[root@percona1 ~]# /etc/init.d/pacemaker start
Starting cluster:
Checking if cluster has been disabled at boot… [ OK ]
Checking Network Manager… [ OK ]
Global setup… [ OK ]
Loading kernel modules… [ OK ]
Mounting configfs… [ OK ]
Starting cman… [ OK ]
Waiting for quorum… [ OK ]
Starting fenced… [ OK ]
Starting dlm_controld… [ OK ]
Tuning DLM kernel config… [ OK ]
Starting gfs_controld… [ OK ]
Unfencing self… [ OK ]
Joining fence domain… [ OK ]
Starting Pacemaker Cluster Manager: [ OK ]

[root@percona2 ~]# /etc/init.d/cman start
[root@percona2 ~]# /etc/init.d/pacemaker start
[root@percona3 ~]# /etc/init.d/cman start
[root@percona3 ~]# /etc/init.d/pacemaker start

We can now connect crm_mon to the cluster and check its status:

[root@percona2 percona]# crm_mon -1
Last updated: Thu May 23 09:18:58 2013
Last change: Thu May 23 09:16:31 2013 via crm_attribute on percona1
Stack: cman
Current DC: percona1 – partition with quorum
Version: 1.1.8-7.el6-394e906
3 Nodes configured, 3 expected votes
7 Resources configured.

Online: [ percona1 percona2 percona3 ]

reader_vip_1 (ocf::heartbeat:IPaddr2): Started percona3
reader_vip_2 (ocf::heartbeat:IPaddr2): Started percona2
reader_vip_3 (ocf::heartbeat:IPaddr2): Started percona1
writer_vip (ocf::heartbeat:IPaddr2): Started percona1
Master/Slave Set: ms_MySQL [p_mysql]
Masters: [ percona1 ]
Slaves: [ percona2 percona3 ]

We can see that some resources changed this is because we didn't put it in maintenance on node1 before the update to 6.4

In case we put everything in maintenance mode as it should be before the upgrade to 6.4, it's time to stop the maintenance mode… but crm command is not present any more 

It's still possible to find the command install crmsh (crm shell from another repository) or just install pcs (Pacemaker Configuration System)

[root@percona2 percona]# yum -y install pcs
[root@percona2 percona]# pcs status
Last updated: Thu May 23 09:24:37 2013
Last change: Thu May 23 09:16:31 2013 via crm_attribute on percona1
Stack: cman
Current DC: percona1 – partition with quorum
Version: 1.1.8-7.el6-394e906
3 Nodes configured, 3 expected votes
7 Resources configured.

Online: [ percona1 percona2 percona3 ]

Full list of resources:

reader_vip_1 (ocf::heartbeat:IPaddr2): Started percona3
reader_vip_2 (ocf::heartbeat:IPaddr2): Started percona2
reader_vip_3 (ocf::heartbeat:IPaddr2): Started percona1
writer_vip (ocf::heartbeat:IPaddr2): Started percona1
Master/Slave Set: ms_MySQL [p_mysql]
Masters: [ percona1 ]
Slaves: [ percona2 percona3 ]

So if you were in maintenance mode, you should have :

[root@percona2 percona]# pcs status
Last updated: Thu May 23 09:26:56 2013
Last change: Thu May 23 09:26:50 2013 via cibadmin on percona2
Stack: cman
Current DC: percona1 – partition with quorum
Version: 1.1.8-7.el6-394e906
3 Nodes configured, 3 expected votes
7 Resources configured.

Online: [ percona1 percona2 percona3 ]

Full list of resources:

reader_vip_1 (ocf::heartbeat:IPaddr2): Started percona3 (unmanaged)
reader_vip_2 (ocf::heartbeat:IPaddr2): Started percona2 (unmanaged)
reader_vip_3 (ocf::heartbeat:IPaddr2): Started percona1 (unmanaged)
writer_vip (ocf::heartbeat:IPaddr2): Started percona1 (unmanaged)
Master/Slave Set: ms_MySQL [p_mysql] (unmanaged)
p_mysql:0 (ocf::percona:mysql): Master percona1 (unmanaged)
p_mysql:1 (ocf::percona:mysql): Slave percona2 (unmanaged)
p_mysql:2 (ocf::percona:mysql): Slave percona3 (unmanaged)

And now you are able to stop maintenance mode:

[root@percona2 percona]# pcs property set maintenance-mode=false

You can also check your cluster using cman_tool or clustat (if you have installed rgmanager)

[root@percona3 ~]# cman_tool nodes
Node Sts Inc Joined Name
1 M 64 2013-05-23 09:52:03 percona1
2 M 64 2013-05-23 09:52:03 percona2
3 M 64 2013-05-23 09:52:03 percona3

[root@percona3 ~]# clustat
Cluster Status for lefred_prm @ Thu May 23 10:20:36 2013
Member Status: Quorate

Member Name ID Status
—— —- —- ——
percona1 1 Online
percona2 2 Online
percona3 3 Online, Local

Now the cluster is fixed and everything works again as expected and you should be ready for the next distro upgrade!

INFO: If you have the file /etc/corosync/service.d/pcmk you need to delete it before installing CMAN

The post How to fix your PRM cluster when upgrading to RHEL/CentOS 6.4 appeared first on MySQL Performance Blog.





No comments:

Post a Comment