What on Earth is a Split-Brain Scenario in a MySQL Database Cluster?

Published Date Author: , Posted June 24th, 2019 at 10:14:17am

Overview

The Skinny

In this blog post we will define what a split-brain scenario means in a MySQL database cluster, and then explore how a Tungsten MySQL database cluster reacts to a split-brain situation.


Agenda

What’s Here?
  • Define the term “split-brain”
  • Briefly explore how the Tungsten Manager works to monitor the cluster health and prevent data corruption in the event of a network partition
  • Also explore how the Tungsten Connector works to route writes
  • Describe how a Tungsten MySQL database cluster reacts to a split-brain situation
  • Illustrate various testing and recovery procedures

Split-Brain: Definition and Impact

Sounds scary, and it is!

A split-brain occurs when a MySQL database cluster which normally has a single write master, has two write-able masters.

This means that some writes which should go to the “real” master are sent to a different node which was promoted to write master by mistake.

Once that happens, some writes exist on one master and not the other, creating two broken masters. Merging the two data sets is impossible, leading to a full restore, which is clearly NOT desirable.

We can say that a split-brain scenario is to be strongly avoided.

A situation like this is most often encountered when there is a network partition of some sort, especially with the nodes spread over multiple availability zones in a single region of a cloud deployment.

This would potentially result in all nodes being isolated, without a clear majority within the voting quorum.

A poorly-designed cluster could elect more than one master under these conditions, leading to the split-brain scenario.


Tungsten Manager: A Primer

A Very Brief Summary

The Tungsten Manager health-checks the cluster nodes and the MySQL databases.

The Manager is responsible for initiating various failure states and helping to automate recovery efforts.

Each Manager communicates with the others via a Java JGroups group chat.

Additionally, the Connectors get status information from a chosen Manager as well…


Tungsten Connector: A Primer

Another Brief Summary

The Tungsten Connector is an intelligent MySQL database proxy located between the clients and the database servers, providing a single connection point, while routing queries to the database servers.

Simply put, the Connector is responsible for routing MySQL queries to the correct node in the cluster.

In the event of a failure, the Tungsten Connector can automatically route queries away from the failed server and towards servers that are still operating.

When the cluster Managers detect a failed master (i.e. because the MySQL server port is no longer reachable), the Connectors are signaled and client traffic is re-routed to the newly-elected Master node.

Each Connector makes a TCP connection to any available Manager, then all command-and-control traffic uses that channel. The Manager never initiates a connection to the Connector.

When there is a state change (i.e. shun, welcome, failover, etc.), the Manager will communicate to the Connector over the existing channel.

The Connector will re-establish a channel to an available Manager if the Manager it is connected to is stopped or lost.

For more detailed information about how the Tungsten Connector works, please read our blog post, “Experience the Power of the Tungsten Connector


Failsafe-Shun: Safety by Design

Protect the data first and foremost!

Since a network partition would potentially result in all nodes being isolated without a clear majority within the voting quorum, the default action of a Tungsten Cluster is to SHUN all of the nodes.

Shunning ALL of the nodes means that no client traffic is being processed by any node, both reads and writes are blocked.

When this happens, it is up to a human administrator to select the proper master and recover the cluster.

The main thing that avoids split-brain in our clustering is that the Connector is either:

  1. connected to a manager that is a member of a quorum or
  2. it is connected to a Manager that has all resources shunned

In the first case, it’s guaranteed to have a single master. In the second case, it can’t connect to anything until the Manager its connected to is in a quorum.

Example Failure Testing Procedures

Use this failure test procedure ONLY in a dev/test environment.
Use this procedure AT YOUR OWN RISK!

A failsafe-shun scenario can be forced.

Given a 3-node cluster east, with master db1 and slaves db2/db3, simply stop the manager process on both slaves and wait about 60 seconds:

[crayon-66215678ac108903090138/]

The Manager on the master node db1 will restart itself after an appropriate timeout and the entire cluster will then be in FAILSAFE-SHUN status.

Once you have verified that the cluster in in FAILSAFE-SHUN status, start the Managers on both slaves before proceeding with recovery:

[crayon-66215678ac113964226642/]

Example Recovery Procedures

First, examine the state of the dataservice and choose which datasource is the most up to date or canonical. For example, within the following example, each datasource has the same sequence number, so any datasource could potentially be used as the master:

[crayon-66215678ac116470411761/]

Recover Master Using

Once you have selected the correct host, use cctrl to call the recover master using command specifying the full service name and hostname of the chosen datasource:

[crayon-66215678ac11b952073783/]

You will be prompted to ensure that you wish to choose the selected host as the new master. cctrl then proceeds to set the new master, and recover the remaining slaves.

If this operation fails, you can try the more manual process, next.

Welcome and Recover

A simple welcome attempt will fail:

[crayon-66215678ac11e997546989/]

To use the welcome command, the force mode must be enabled first:

[crayon-66215678ac121615129150/]

Note the OFFLINE state as the result? In AUTOMATIC mode, the datasource will be set to ONLINE before you have the time to run the ls command and look at the cluster state:

[crayon-66215678ac125898203868/]

Finally, recover the remaining slaves:

[crayon-66215678ac12a752547675/]

Summary

The Wrap-Up

In this blog post we defined what a split-brain scenario means, and explored how a Tungsten MySQL database cluster reacts to a split-brain situation.

To learn about Continuent solutions in general, check out https://www.continuent.com/solutions


The Library

Please read the docs!

For more information about Tungsten Cluster recovery procedures, please visit https://docs.continuent.com/tungsten-clustering-6.0/operations-recovery-master.html

Tungsten Clustering is the most flexible, performant global database layer available today – use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions

Want to learn more or run a POC? Contact us.

Comments're closed  Comments are closed.