技術メモメモ: CentOS 7 + Pacemaker + CorosyncでMariaDBをクラスター化する③ (障害試験編)

PacemakerとCorosyncで構築するクラスターの記事も今回で最後となる。

★前回の記事はこちら↓

CentOS 7 + Pacemaker + CorosyncでMariaDBをクラスター化する① (準備・インストール編)
https://tech-mmmm.blogspot.com/2018/11/centos-7-pacemaker-corosyncmariadb.html

CentOS 7 + Pacemaker + CorosyncでMariaDBをクラスター化する② (クラスター・リソース構成編)

https://tech-mmmm.blogspot.com/2018/11/centos-7-pacemaker-corosyncmariadb_65.html

3回目となる最後は、実際に前回までに設定したクラスターで疑似障害を発生させ、その際のリソースのフェイルオーバーの動きを見ることにする。

障害発生のパターンは複数考えられるが、今回は、手動でのフェイルオーバー手順と、サーバー障害とNIC障害を代表的な障害として、動作確認を行う。

手動フェイルオーバー① (コマンドでリソースグループを移動)

手動でフェイルオーバーさせるために、わざわざサーバーを再起動したりするのは面倒なので、コマンドでリソースグループをフェイルオーバーさせてみる。

# pcs resource show

------------------------------

Resource Group: rg01

VirtualIP (ocf::heartbeat:IPaddr2): Started t1113cent

ShareDir (ocf::heartbeat:Filesystem): Started t1113cent

MariaDB (systemd:mariadb): Starting t1113cent

------------------------------

以下コマンドでリソースグループrg01をノードt1114centに移動させる。

# pcs resource move rg01 t1114cent

確認してみると、確かにリソースグループがt1114centに移動していることがわかる。

# pcs resource show

------------------------------

Resource Group: rg01

VirtualIP (ocf::heartbeat:IPaddr2): Started t1114cent

ShareDir (ocf::heartbeat:Filesystem): Started t1114cent

MariaDB (systemd:mariadb): Starting t1114cent

------------------------------

リソースグループをフェイルバックさせる場合は、以下の通りもともと起動していたノードにリソースグループを移動しなおすことで対応する。

# pcs resource move rg01 t1113cent

手動でフェイルオーバー② (ノードをスタンバイにする)

コマンドでフェイルオーバーさせる手順以外にも、ノードをスタンバイにすることで強制的にリソースグループを移動させることもできるので、手順を紹介しておこう。

# pcs status
------------------------------
Cluster name: cluster2
Stack: corosync
Current DC: t1114cent (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Nov 24 15:00:47 2018
Last change: Sat Nov 24 15:00:44 2018 by root via cibadmin on t1114cent

2 nodes configured
3 resources configured

Online: [ t1113cent t1114cent ]

Full list of resources:

Resource Group: rg01
VirtualIP (ocf::heartbeat:IPaddr2): Started t1113cent
ShareDir (ocf::heartbeat:Filesystem): Started t1113cent
MariaDB (systemd:mariadb): Started t1113cent

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
------------------------------

現在リソースグループが起動しているt1113centをスタンバイにする。

# pcs cluster standby t1113cent

再度クラスターの状態を確認する。

# pcs status
------------------------------
Cluster name: cluster2
Stack: corosync
Current DC: t1114cent (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sat Nov 24 15:03:50 2018
Last change: Sat Nov 24 15:03:42 2018 by root via cibadmin on t1113cent

2 nodes configured
3 resources configured

Node t1113cent: standby　　　←★standbyになっている
Online: [ t1114cent ]

Full list of resources:

Resource Group: rg01　　　←★リソースがすべてt1114centに移動している
VirtualIP (ocf::heartbeat:IPaddr2): Started t1114cent
ShareDir (ocf::heartbeat:Filesystem): Started t1114cent
MariaDB (systemd:mariadb): Starting t1114cent

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
------------------------------

リソースグループをフェイルバックさせる場合は、以下のとおり実施する。

# pcs cluster unstandby t1113cent　←★t1113centのスタンバイを解除

# pcs cluster standby t1114cent　　←★t1114centをスタンバイ

# pcs cluster unstandby t1113cent　←★t1114centのスタンバイを解除

サーバーダウン障害

サーバーの突発停止や再起動が発生した場合を想定して、クラスターの動作を検証する。

VMware Host Clientにて、仮想マシンの再起動を実施する。

リソースグループは正常に切り替わっていることを確認できる。

# pcs status
------------------------------
Cluster name: cluster2
Stack: corosync
Current DC: t1114cent (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sun Nov 25 06:33:04 2018
Last change: Sun Nov 25 06:25:18 2018 by root via crm_resource on t1113cent

2 nodes configured
3 resources configured

Online: [ t1114cent ]
OFFLINE: [ t1113cent ]　　　←★OFFLINEになる

Full list of resources:

Resource Group: rg01
VirtualIP (ocf::heartbeat:IPaddr2): Started t1114cent
ShareDir (ocf::heartbeat:Filesystem): Started t1114cent
MariaDB (systemd:mariadb): Started t1114cent
↑★t1114centに切り替わっている

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
------------------------------

NIC障害

サービス用のNICが切断され、VIPが消えた場合のクラスターの動作を検証する。

本記事では割愛するが、VIPの監視にon-fail=standbyのオプションを指定しないとうまく切り替わらないので、あらかじめ設定しておくことにする。

# pcs resource update VirtualIP op monitor on-fail=standby
# pcs resource show --full
------------------------------
Group: rg01
Resource: VirtualIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=192.168.11.115 nic=ens192
Operations: monitor interval=60s on-fail=standby (VirtualIP-monitor-interval-60s)
start interval=0s timeout=20s (VirtualIP-start-interval-0s)
stop interval=0s timeout=20s (VirtualIP-stop-interval-0s)
Resource: ShareDir (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/sharevg/lv001 directory=/share fstype=xfs
Operations: monitor interval=20 timeout=40 (ShareDir-monitor-interval-20)
notify interval=0s timeout=60 (ShareDir-notify-interval-0s)
start interval=0s timeout=60 (ShareDir-start-interval-0s)
stop interval=0s timeout=60 (ShareDir-stop-interval-0s)
Resource: MariaDB (class=systemd type=mariadb)
Operations: monitor interval=60 timeout=100 (MariaDB-monitor-interval-60)
start interval=0s timeout=100 (MariaDB-start-interval-0s)
stop interval=0s timeout=100 (MariaDB-stop-interval-0s)
------------------------------

疑似障害を起こすため、仮想マシンのNICを切断する。

1分程度でリソースVirtualIPがFAILEDステータスになり、リソースグループがフェイルオーバーされる。

# pcs status
------------------------------
Cluster name: cluster2
Stack: corosync
Current DC: t1114cent (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sun Nov 25 06:03:12 2018
Last change: Sat Nov 24 15:16:21 2018 by root via crm_resource on t1113cent

2 nodes configured
3 resources configured

Node t1113cent: standby (on-fail)　　　←★standby (on-fail)になる
Online: [ t1114cent ]

Full list of resources:

Resource Group: rg01
VirtualIP (ocf::heartbeat:IPaddr2): FAILED t1113cent　←★FAILEDになる
ShareDir (ocf::heartbeat:Filesystem): Started t1113cent
MariaDB (systemd:mariadb): Stopping t1113cent

Failed Actions:
* VirtualIP_monitor_60000 on t1113cent 'not running' (7): call=51, status=complete, exitreason='',
last-rc-change='Sun Nov 25 06:03:09 2018', queued=0ms, exec=0ms
↑★VirtualIPに関するログが表示される

Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
------------------------------

障害ノードが復旧し、クラスターの状態を戻す場合は、以下コマンドを実行する。

# pcs resource cleanup
------------------------------
Cleaned up all resources on all nodes
Waiting for 1 replies from the CRMd. OK
------------------------------

上記コマンドでクラスターを復旧させると、リソースグループが元のノードにフェイルバックするので注意。どうやらもともと稼働していたノードのスコアがINFINITYのままになるため、フェイルバックするようだ。この辺りをきちんと制御する意味でも、STONITHの設定を有効にし、障害が発生したノードは強制停止する設定をした方がよいかもしれない。

まとめ

以上で、全3回にわたって説明してきたPacemaker + Corosyncの設定手順と動作検証の記事は終了となる。
設定手順自体が多く複雑ではあるが、設定はすべてコマンドで実施できるよう構成されており、複雑な設定ファイルを作成する必要がない分、慣れれば構成しやすいソフトウェアではないかと感じた。
今回は実施していないが、本来はスプリットブレイン対策としてSTONITHの設定が推奨されることから、別途STONITHの設定手順なども検証して記載したいと考えている。

技術メモメモ

CentOS 7 + Pacemaker + CorosyncでMariaDBをクラスター化する③ (障害試験編)

手動フェイルオーバー① (コマンドでリソースグループを移動)

手動でフェイルオーバー② (ノードをスタンバイにする)

サーバーダウン障害

NIC障害

まとめ

0 件のコメント:

コメントを投稿

人気の投稿