
noeud incapable de rejoindre un cluster Galera; Il échoue pendant le transfert de l'état

On dirait que le nœud rejoint le cluster puis échoue ... J'ai essayé avec Rsync et Xtrabackup et il échoue pendant le transfert de l'état. Je me semble que je manque quelque chose de simple simple et je ne suis pas capable de mettre un doigt dessus. Toute aide serait appréciée.

Plus d'informations sur les nœuds

Master - 10.xxx.xxx.161 Node1 - 10.xxx.xxx.160

Forfaits installés: MARIADB-COMPATR-MARIADB-COMMON MARIADB-Devel MariDB-Client-CLIENT MARIADB-SERVICE MARIADB-GALERA-SERVER (V5.5.29-1) Galera (V23.2.4-1.RHEL6) Percona-Xtrabackup (v2.1.6 -702.rhel6)

config pour le nœud 1

wsrep_cluster_address = gcomm://10.XXX.XXX.161
wsrep_provider = /usr/lib64/galera/libgalera_smm.so
wsrep_provider_options = gcache.size=4G; gcache.page_size=1G
wsrep_cluster_name = galera_cluster
default_storage_engine = InnoDB
innodb_autoinc_lock_mode = 2
innodb_locks_unsafe_for_binlog = 1
wsrep_sst_method = xtrabackup
wsrep_sst_auth = root:rootpassword

config pour maître

wsrep_cluster_address = gcomm://
wsrep_provider = /usr/lib64/galera/libgalera_smm.so
wsrep_provider_options = gcache.size=4G; gcache.page_size=1G
wsrep_cluster_name = galera_cluster
default_storage_engine = InnoDB
innodb_autoinc_lock_mode = 2
innodb_locks_unsafe_for_binlog = 1
wsrep_sst_method = rsync
wsrep_slave_threads = 4
wsrep_sst_auth = root:rootpassword
wsrep_node_name = 2

fichier journal nœud1

131203 16:09:03 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
131203 16:09:03 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.f2EedjRjbQ
131203 16:09:08 mysqld_safe WSREP: Recovered position 359350ee-5c63-11e3-0800-6673d15135cd:2188
131203 16:09:08 [Note] WSREP: wsrep_start_position var submitted: '359350ee-5c63-11e3-0800-6673d15135cd:2188'
131203 16:09:08 [Note] WSREP: Read nil XID from storage engines, skipping position init
131203 16:09:08 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
131203 16:09:08 [Note] WSREP: wsrep_load(): Galera 23.2.4(r147) by Codership Oy <[email protected]]]> loaded succesfully.
131203 16:09:08 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
131203 16:09:08 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
131203 16:09:08 [Note] WSREP: Passing config to GCS: base_Host = 10.XXX.XXX.160; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 1G; gcache.size = 4G; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
131203 16:09:08 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
131203 16:09:08 [Note] WSREP: wsrep_sst_grab()
131203 16:09:08 [Note] WSREP: Start replication
131203 16:09:08 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
131203 16:09:08 [Note] WSREP: protonet asio version 0
131203 16:09:08 [Note] WSREP: backend: asio
131203 16:09:08 [Note] WSREP: GMCast version 0
131203 16:09:08 [Note] WSREP: (8814b4ba-5c67-11e3-0800-91035d554a96, 'tcp://') listening at tcp://
131203 16:09:08 [Note] WSREP: (8814b4ba-5c67-11e3-0800-91035d554a96, 'tcp://') multicast: , ttl: 1
131203 16:09:08 [Note] WSREP: EVS version 0
131203 16:09:08 [Note] WSREP: PC version 0
131203 16:09:08 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer '10.XXX.XXX.161:'
131203 16:09:09 [Note] WSREP: declaring 7a9a87e8-5c67-11e3-0800-8cb6cba8f65a stable
131203 16:09:09 [Note] WSREP: Node 7a9a87e8-5c67-11e3-0800-8cb6cba8f65a state prim
131203 16:09:09 [Note] WSREP: view(view_id(PRIM,7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,2) memb {
} joined {
} left {
} partitioned {
131203 16:09:09 [Note] WSREP: gcomm: connected
131203 16:09:09 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
131203 16:09:09 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
131203 16:09:09 [Note] WSREP: Opened channel 'galera_cluster'
131203 16:09:09 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
131203 16:09:09 [Note] WSREP: Waiting for SST to complete.
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: sent state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515 from 0 (2)
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515 from 1 (1)
131203 16:09:09 [Note] WSREP: Quorum results:
     version    = 2,
     component  = PRIMARY,
     conf_id    = 1,
     members    = 1/2 (joined/total),
     act_id     = 2521,
     last_appl. = -1,
     protocols  = 0/4/2 (gcs/repl/appl),
     group UUID = 359350ee-5c63-11e3-0800-6673d15135cd
131203 16:09:09 [Note] WSREP: Flow-control interval: [23, 23]
131203 16:09:09 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 2521)
131203 16:09:09 [Note] WSREP: State transfer required:
     Group state: 359350ee-5c63-11e3-0800-6673d15135cd:2521
     Local state: 00000000-0000-0000-0000-000000000000:-1
131203 16:09:09 [Note] WSREP: New cluster view: global state: 359350ee-5c63-11e3-0800-6673d15135cd:2521, view# 2: Primary, number of nodes: 2, my index: 1, protocol version 2
131203 16:09:09 [Warning] WSREP: Gap in state sequence. Need state transfer.
131203 16:09:11 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.XXX.XXX.160' --auth 'root:rootpassword' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '13175''
131203 16:09:11 [Note] WSREP: Prepared SST request: xtrabackup|
131203 16:09:11 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:09:11 [Note] WSREP: Assign initial position for certification: 2521, protocol version: 2
131203 16:09:11 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (359350ee-5c63-11e3-0800-6673d15135cd): 1 (Operation not permitted)
      at galera/src/replicator_str.cpp:prepare_for_IST():442. IST will be unavailable.
131203 16:09:11 [Note] WSREP: Node 1 (1) requested state transfer from '*any*'. Selected 0 (2)(SYNCED) as donor.
131203 16:09:11 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2525)
131203 16:09:11 [Note] WSREP: Requesting state transfer: success, donor: 0
tar: dbexport/db.opt: Cannot open: Permission denied
tar: Exiting with failure status due to previous errors
131203 16:10:22 [Note] WSREP: 0 (2): State transfer to 1 (1) complete.
131203 16:10:22 [Note] WSREP: Member 0 (2) synced with group.
WSREP_SST: [ERROR] Error while getting st data from donor node:  0, 2 (20131203 16:10:22.379)
131203 16:10:22 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup --role 'joiner' --address '10.XXX.XXX.160' --auth 'root:rootpassword' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '13175': 32 (Broken pipe)
131203 16:10:22 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
131203 16:10:22 [ERROR] WSREP: SST failed: 32 (Broken pipe)
131203 16:10:22 [ERROR] Aborting
131203 16:10:24 [Note] WSREP: Closing send monitor...
131203 16:10:24 [Note] WSREP: Closed send monitor.
131203 16:10:24 [Note] WSREP: gcomm: terminating thread
131203 16:10:24 [Note] WSREP: gcomm: joining thread
131203 16:10:24 [Note] WSREP: gcomm: closing backend
131203 16:10:25 [Note] WSREP: view(view_id(NON_PRIM,7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,2) memb {
} joined {
} left {
} partitioned {
131203 16:10:25 [Note] WSREP: view((empty))
131203 16:10:25 [Note] WSREP: gcomm: closed
131203 16:10:25 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
131203 16:10:25 [Note] WSREP: Flow-control interval: [16, 16]
131203 16:10:25 [Note] WSREP: Received NON-PRIMARY.
131203 16:10:25 [Note] WSREP: Shifting JOINER -> OPEN (TO: 2607)
131203 16:10:25 [Note] WSREP: Received self-leave message.
131203 16:10:25 [Note] WSREP: Flow-control interval: [0, 0]
131203 16:10:25 [Note] WSREP: Received SELF-LEAVE. Closing connection.
131203 16:10:25 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 2607)
131203 16:10:25 [Note] WSREP: RECV thread exiting 0: Success
131203 16:10:25 [Note] WSREP: recv_thread() joined.
131203 16:10:25 [Note] WSREP: Closing slave action queue.
131203 16:10:25 [Note] WSREP: Service disconnected.
131203 16:10:25 [Note] WSREP: rollbacker thread exiting
131203 16:10:26 [Note] WSREP: Some threads may fail to exit.
131203 16:10:26 [Note] /usr/sbin/mysqld: Shutdown complete
Error in my_thread_global_end(): 2 threads didn't exit
131203 16:10:31 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

fichier de journal principal

131203 16:08:47 [Warning] Recovery from master pos 103358630 and file mysql-bin.000131.
131203 16:08:47 [Note] Event Scheduler: Loaded 0 events
131203 16:08:47 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:08:47 [Note] WSREP: Assign initial position for certification: 2497, protocol version: 2
131203 16:08:47 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.29-MariaDB-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server, wsrep_23.7.3.rXXXX
131203 16:08:47 [Note] WSREP: Synchronized with group, ready for connections
131203 16:08:47 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:09:09 [Note] WSREP: declaring 8814b4ba-5c67-11e3-0800-91035d554a96 stable
131203 16:09:09 [Note] WSREP: Node 7a9a87e8-5c67-11e3-0800-8cb6cba8f65a state prim
131203 16:09:09 [Note] WSREP: view(view_id(PRIM,7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,2) memb {
} joined {
} left {
} partitioned {
131203 16:09:09 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
131203 16:09:09 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 8861cdd5-5c67-11e3-0800-cc70fcc5f515
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: sent state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515 from 0 (2)
131203 16:09:09 [Note] WSREP: STATE EXCHANGE: got state msg: 8861cdd5-5c67-11e3-0800-cc70fcc5f515 from 1 (1)
131203 16:09:09 [Note] WSREP: Quorum results:
     version    = 2,
     component  = PRIMARY,
     conf_id    = 1,
     members    = 1/2 (joined/total),
     act_id     = 2521,
     last_appl. = 2517,
     protocols  = 0/4/2 (gcs/repl/appl),
     group UUID = 359350ee-5c63-11e3-0800-6673d15135cd
131203 16:09:09 [Note] WSREP: Flow-control interval: [23, 23]
131203 16:09:09 [Note] WSREP: New cluster view: global state: 359350ee-5c63-11e3-0800-6673d15135cd:2521, view# 2: Primary, number of nodes: 2, my index: 0, protocol version 2
131203 16:09:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:09:09 [Note] WSREP: Assign initial position for certification: 2521, protocol version: 2
131203 16:09:11 [Note] WSREP: Node 1 (1) requested state transfer from '*any*'. Selected 0 (2)(SYNCED) as donor.
131203 16:09:11 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 2525)
131203 16:09:11 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:09:11 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'donor' --address '10.XXX.XXX.160:4444/xtrabackup_sst' --auth 'root:rootpassword' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid '359350ee-5c63-11e3-0800-6673d15135cd:2525''
131203 16:09:11 [Note] WSREP: sst_donor_thread signaled with 0
131203 16:10:20 [Note] WSREP: Provider paused at 359350ee-5c63-11e3-0800-6673d15135cd:2604
131203 16:10:22 [Note] WSREP: Provider resumed.
131203 16:10:22 [Note] WSREP: 0 (2): State transfer to 1 (1) complete.
131203 16:10:22 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 2606)
131203 16:10:22 [Note] WSREP: Member 0 (2) synced with group.
131203 16:10:22 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2606)
131203 16:10:22 [Note] WSREP: Synchronized with group, ready for connections
131203 16:10:22 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:10:25 [Note] WSREP: Node 7a9a87e8-5c67-11e3-0800-8cb6cba8f65a state prim
131203 16:10:25 [Note] WSREP: view(view_id(PRIM,7a9a87e8-5c67-11e3-0800-8cb6cba8f65a,3) memb {
} joined {
} left {
} partitioned {
131203 16:10:25 [Note] WSREP: forgetting 8814b4ba-5c67-11e3-0800-91035d554a96 (tcp://10.XXX.XXX.160:4567)
131203 16:10:25 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
131203 16:10:25 [Note] WSREP: STATE_EXCHANGE: sent state UUID: b5dda52e-5c67-11e3-0800-4b2360dd84f9
131203 16:10:25 [Note] WSREP: STATE EXCHANGE: sent state msg: b5dda52e-5c67-11e3-0800-4b2360dd84f9
131203 16:10:25 [Note] WSREP: STATE EXCHANGE: got state msg: b5dda52e-5c67-11e3-0800-4b2360dd84f9 from 0 (2)
131203 16:10:25 [Note] WSREP: Quorum results:
     version    = 2,
     component  = PRIMARY,
     conf_id    = 2,
     members    = 1/1 (joined/total),
     act_id     = 2607,
     last_appl. = 2597,
     protocols  = 0/4/2 (gcs/repl/appl),
     group UUID = 359350ee-5c63-11e3-0800-6673d15135cd
131203 16:10:25 [Note] WSREP: Flow-control interval: [16, 16]
131203 16:10:25 [Note] WSREP: New cluster view: global state: 359350ee-5c63-11e3-0800-6673d15135cd:2607, view# 3: Primary, number of nodes: 1, my index: 0, protocol version 2
131203 16:10:25 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131203 16:10:25 [Note] WSREP: Assign initial position for certification: 2607, protocol version: 2
131203 16:10:30 [Note] WSREP:  cleaning up 8814b4ba-5c67-11e3-0800-91035d554a96 (tcp://10..XXX.XXX.160:4567)

Nous avons eu un problème similaire ici; Les nœuds 2 et 3 ont été partitionnés sur Node 1, et lorsque nous avons essayé de les amener, le SST a échoué (il y avait eu une énorme quantité de mises à jour de dB et que le gcache de 300 m avait déjà été retourné et un ist n'était pas possible).

La solution consistait à redémarrer le nœud 1, ce qui a effacé le nœud WSREP indiqué; Ensuite, commencez le nœud 2, qui a fait un (très long) sst et est venu ok. Ensuite, soulevez le noeud 3 de la même manière et tous les 3 membres et les membres.

Toutefois, cette solution signifie un bref service (<1min) une panne lorsque le nœud principal est redémarré, ce qui peut ne pas être acceptable.

Je n'ai pas encore trouvé un moyen de nettoyer le statut "partitionné" du nœud Down sur le nœud survivant sans redémarrer.

Comme vous, nous utilisons Percona Xtrabackup pour le WSREP. Je crois que le problème réside dans cela nécessitant une connexion de base de données; Si vous utilisez RSYNC, la SST peut compléter comme prévu.

Steve Shipway

Galera Cluster est un système basé sur le quorum, le cluster Galera est soumis à une condition cérébrale fractionnée lorsque l'algorithme de quorum ne parvient pas à sélectionner un composant principal. Cela peut arriver, par exemple, dans un cluster sans le commutateur de sauvegarde si l'interrupteur principal échoue. Mais la situation cérébrale la plus probable est lorsque le nœud unique échoue dans un groupe à deux nœuds. Il est donc fortement conseillé que la configuration minimale Galera Cluster est de 3 nœuds. Dans la section Transfert d'état ci-dessous, nous examinerons une autre raison pour laquelle 3 est le nombre minimum recommandé de nœuds.

Alors, assurez-vous que vous avez

wsrep_cluster_address = GCOMM: //,, #where node1 = et node2 = etc.

sur tous les nœuds et redémarrez un seul noeud de tous les nœuds après ce changement

Un typique my.cnf pour le cluster Galera devrait être comme ceci: http://www.codership.com/wiki/doku.php?id=mysql_galera_configuration

J'espère que cette information aide

Mahesh Patil