フェールオーバー後、元の構成に戻す際に各リソースが再起動する (Linux-ha-jp) - Linux-HA Japan

根本さん

こんにちは、山内です。

> DRBD+pacemakerで冗長化構成にしていたサーバに障害が発生し
> フェールオーバー後に、元の構成に戻す際の質問です。
> 
> [環境]
> OS : CentOS 5.6
> ※ 正確には、Openvz Kernelです。
> # uname -a
> Linux 03a.ss.jp 2.6.18-194.26.1.el5.028stab079.2 #1 SMP Fri Jan 28 20:28:13 JST 2011 x86_64 x86_64 x86_64 GNU/Linux)
> 
> 構成：姉妹構成(2台構成)
> pacemaker-1.0.10-1.4.el5(Linux-HA Japan提供)
> 
> # crm_mon -1
> ============
> Last updated: Tue Jun 28 11:37:59 2011
> Stack: Heartbeat
> Current DC: 03a.ss.jp (b6f00617-89f3-41bd-9a94-6c492648b173) - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, unknown expected votes
> 4 Resources configured.
> ============
> 
> Online: [ 03a.ss.jp ]
> OFFLINE: [ 03b.ss.jp ]
> 
>  Resource Group: group1
>      res_Filesystem1    (ocf::heartbeat:Filesystem):    Started 03a.ss.jp
>      res_ip1    (ocf::heartbeat:IPaddr2):       Started 03a.ss.jp
>      res_vps1   (ocf::heartbeat:vps):   Started 03a.ss.jp
>      res_MailTo1        (ocf::heartbeat:MailTo):        Started 03a.ss.jp
>  Resource Group: group2
>      res_Filesystem2    (ocf::heartbeat:Filesystem):    Started 03a.ss.jp
>      res_ip2    (ocf::heartbeat:IPaddr2):       Started 03a.ss.jp
>      res_vps2   (ocf::heartbeat:vps):   Started 03a.ss.jp
>      res_MailTo2        (ocf::heartbeat:MailTo):        Started 03a.ss.jp
>  Master/Slave Set: ms_drbd1
>      Masters: [ 03a.ss.jp ]
>      Stopped: [ res_drbd1:1 ]
>  Master/Slave Set: ms_drbd2
>      Masters: [ 03a.ss.jp ]
>      Stopped: [ res_drbd2:1 ]
> 
> 
> 
> 上記の状態から、03b.ss.jpが復旧したというシュチエーションで動作を確認しました。
> 復旧の手順としては
> 
> 1. 03b.ss.jp のDRBDを起動する。
> # /etc/rc.d/init.d/drbd start
> 
> 2.完全に同期するまで放置
> # cat /proc/drbd    (03b.ss.jp側から確認)
> version: 8.3.4 (api:88/proto:86-91)
> GIT-hash: 70a645ae080411c87b4482a135847d69dc90a6a2 build by xemul****@ovzco*****, 2009-10-12 19:29:01
> 
>  1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
>     ns:2020576 nr:803960672 dw:807577716 dr:2657089 al:4845 bm:28439 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
>     ns:79897088 nr:445831256 dw:525754108 dr:2150881 al:6739 bm:24658 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> 3.↑の状態を確認後 03b.ss.jpのheartbeatを起動する
> # /etc/rc.d/init.d/heartbeat start
> 
> 
> 
> heartbeatを起動した後、group1、group2の各リソースが一斉にストップ -> スタート する動作をしました。
> 私としては、各リソースの停止は行わないで欲しいのですが、これは設定等で回避できるのでしょうか。
> ログを確認する限り、ERROR、WARN が出ているので多分私の設定が悪いのだと思いますが・・・

同じ環境で確認出来ないので、ログのみの回答ですが。。。
ログを見る限り、03b.ss.jp側のリソースの起動確認(monitor_0)が起動済みとして処理されている為、一度、停止が行われています。（7が未起動、0が起動済み)
（本来なら、03b.ss.jp側ではリソースは起動されていないので、停止動作は動きません）
03b.ss.jp側のリソースは、Hearbeatが起動される前に停止されているかどうか今一度ご確認ください。
＃また、該当のリソースのmonitor処理にログなどを仕込んで問題を確認されるのも手かと思います）

Jun 28 11:55:10 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_MailTo1_monitor_0 (18) confirmed on 03b.ss.jp (rc=0)
Jun 28 11:55:10 03a.ss.jp crmd: [3064]: WARN: status_from_rc: Action 17 (res_vps1_monitor_0) on 03b.ss.jp failed (target: 7 vs. rc: 0): Error
Jun 28 11:55:10 03a.ss.jp crmd: [3064]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=res_vps1_monitor_0, magic=0:0;17:4594:7:8cc3ab0a-90ab-48a8-aeb8-e52a873290af, cib=0.41.35) : Event failed
Jun 28 11:55:10 03a.ss.jp crmd: [3064]: info: update_abort_priority: Abort priority upgraded from 0 to 1
Jun 28 11:55:10 03a.ss.jp crmd: [3064]: info: update_abort_priority: Abort action done superceeded by restart
Jun 28 11:55:10 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_vps1_monitor_0 (17) confirmed on 03b.ss.jp (rc=4)
Jun 28 11:55:10 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_Filesystem1_monitor_0 (15) confirmed on 03b.ss.jp (rc=0)
Jun 28 11:55:10 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_ip1_monitor_0 (16) confirmed on 03b.ss.jp (rc=0)
Jun 28 11:55:11 03a.ss.jp crmd: [3064]: WARN: status_from_rc: Action 21 (res_vps2_monitor_0) on 03b.ss.jp failed (target: 7 vs. rc: 0): Error
Jun 28 11:55:11 03a.ss.jp crmd: [3064]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=res_vps2_monitor_0, magic=0:0;21:4594:7:8cc3ab0a-90ab-48a8-aeb8-e52a873290af, cib=0.41.38) : Event failed
Jun 28 11:55:11 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_vps2_monitor_0 (21) confirmed on 03b.ss.jp (rc=4)
Jun 28 11:55:11 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_MailTo2_monitor_0 (22) confirmed on 03b.ss.jp (rc=0)
Jun 28 11:55:11 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_ip2_monitor_0 (20) confirmed on 03b.ss.jp (rc=0)
Jun 28 11:55:11 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_Filesystem2_monitor_0 (19) confirmed on 03b.ss.jp (rc=0)
Jun 28 11:55:12 03a.ss.jp crmd: [3064]: WARN: status_from_rc: Action 24 (res_drbd2:1_monitor_0) on 03b.ss.jp failed (target: 7 vs. rc: 0): Error
Jun 28 11:55:12 03a.ss.jp crmd: [3064]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=res_drbd2:1_monitor_0, magic=0:0;24:4594:7:8cc3ab0a-90ab-48a8-aeb8-e52a873290af, cib=0.41.42) : Event failed
Jun 28 11:55:12 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_drbd2:1_monitor_0 (24) confirmed on 03b.ss.jp (rc=4)
Jun 28 11:55:12 03a.ss.jp crmd: [3064]: WARN: status_from_rc: Action 23 (res_drbd1:1_monitor_0) on 03b.ss.jp failed (target: 7 vs. rc: 0): Error
Jun 28 11:55:12 03a.ss.jp crmd: [3064]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=res_drbd1:1_monitor_0, magic=0:0;23:4594:7:8cc3ab0a-90ab-48a8-aeb8-e52a873290af, cib=0.41.43) : Event failed
Jun 28 11:55:12 03a.ss.jp crmd: [3064]: info: match_graph_event: Action res_drbd1:1_monitor_0 (23) confirmed on 03b.ss.jp (rc=4)

> ちなみに、ストップ -> スタートした後は、以下の手順で無理矢理元の構成に戻しています。
> 
> 
> 
> 4.03b.ss.jpのcrmからリソースを移動
> # crm
> crm(live)# resource
> crm(live)resource# migrate group2
> WARNING: Creating rsc_location constraint 'cli-standby-group2' with a score of -INFINITY for resource group2 on 03a.ss.jp.
>         This will prevent group2 from running on 03a.ss.jp until the constraint is removed using the 'crm_resource -U' command or manually with cibadmin
>         This will be the case even if 03a.ss.jp is the last node in the cluster
>         This message can be disabled with -Q
> 
> 5.「4」の作業によりlocationの設定が勝手に追加されたので、これを削除
> (そうしないと03b.ss.jpが落ちたとき03a.ss.jpがgroup2のリソースを立ち上げてくれない)
> # crm
> crm(live)# configure
> crm(live)configure# edit
> 該当箇所を削除
> crm(live)configure# commit

こちらは、crmのunmoveとかでも同様のことが出来ます。
#editするよりは、安全かと思います。

また、drbdのRAですが、linbitのRAを利用した方がよいようです。

以上、あまり問題解決の助けにならないかも知れませんが。。。。


> 
> 
> 「4」と「5」の作業をしている時点で、私の設定が間違っている可能性が大なのですが
> heartbeatのログ(ha-log)とcrmからsaveして取り出した設定ファイル(configure)を添付します。

すいません、設定の方は確認していません。
まずは、monitor_0のエラーの要因がないかどうかご確認ください。

> 
> 
> 第三回勉強会の準備などで忙しい方々も多いと思いますが、よろしくお願い致します。。
> また、今回の勉強会は、諸事情で参加できません。(´・ω・｀)
> 
> 
> 不明点などがあれば、ご指摘頂下さい。
> 以上、よろしくお願い致します。
> 
> 
> 
> 
> 根本 稔也
> ----
> nemo****@zuku*****
> ----
>

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools

[Linux-ha-jp] フェールオーバー後、元の構成に戻す際に各リソースが再起動する

Linux-HA Japan Forkpm_logconv-cspm_diskdpm_logconv-hbpm_extrasdocpm_crmgenvm-ctlpm_kvm_tools

[Linux-ha-jp] フェールオーバー後、元の構成に戻す際に各リソースが再起動する

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools