The time has come. People keep on asking why there is a practical limit on the number of nodes in multi-master cluster and what is it exactly. So here's some no-nonsense hardcore multi-master math
(hereafter I assume that all nodes in a cluster have the same processing power and load is distributed uniformly between them).
Let's denote the total work that a node can do as W, the work that it doing serving local connections as L and the work that it is doing applying replication events from another node as R. While W is constant, L and R are variable and depend on the number of nodes. Clearly R is an unproductive replication overhead which we don't want to have there, but have no choice.
Then for a single server we have W = L.
For a 2-node cluster we have W = L + R.
For a N-node cluster we have W = L + R×(N− 1).
Let's denote ratio R/L as α. This quantity lies between 0 and 1 and depends on the load profile and applier efficiency. For read-intensive loads it is closer to 0, for write-intensive loads it is closer to 1. Then for a N-node cluster we have:
W = L + L×α×(N− 1) = L×(1 + α×(N− 1))
Now if we define node efficiency ε as L/W (the ratio of the work done by the node to serve clients to the work it could have done as a single server), we get
ε = 1 / (1 + α×(N− 1))
And cluster scale factor, i.e. the ratio of useful (serving) work done by cluster to the work done by a single server becomes
S = N×ε< 1/α
So there is a strictly defined scalability limit which is approached asymptotically as N approaches infinity.
What sucks about math is that these scribbles give little feeling as to how bad things really are. So here's some fine art:
Note that the above calculations are valid for any type of multi-master replication and are absolute upper limits on node efficiency and cluster scale factor. That is, for a given α you just can't do better, period. Of course in practice there always is some replication overhead, so things are even less favourable to scalability.
There are 2 important conclusions from this:
- α is paramount. Since a lot of it depends on the amount of work done on the "slave" side, making replication events as fast possible to apply is the single most important key to multi-master scalability.
- We actually can answer the question: the optimum number of nodes in multi-master cluster is roughly 1/α plus 1.
When it comes to Galera, however, certification used in Galera is very sensitive to the number of nodes. Due to "birthday paradox" the rate of certification conflicts raises sharply with the number of nodes especially if the number of concurrent connections is increased. Thus in Galera cluster the best results may be achieved by limiting the number of master nodes even further.
(BTW, Baron Schwarz of Percona gave a talk at the recent MySQL User Conference where he likewise argues that there is a practical limit on the number of slaves in a master-slave cluster.)