Quantcast
Channel: Galera Cluster for MySQL » Blog
Viewing all articles
Browse latest Browse all 36

Scaling Drupal stack with Galera: part 2, The Mystery of a Failed Login

$
0
0

This is the second part in in the series of posts about scaling Drupal stack. The first part can be found here.

GLB has been fixed to support unlimited connections and now I can benchmark Drupal stack cluster on large EC2 instances. What I'm looking for here mainly is how much Galera synchronization overhead affects performance here. With small instances everything was pretty much clear: single core hindered by Xen was an easy traget, and scalability was predictably linear. Large instance is dual core, and Xen interference is minimal, Galera synchronization and serialization effect must be more pronounced. How desperately bad is it?
We'll start with looking at HTTP load "elasticity" on large EC2 instance:

Users	Throughput	Latency		Errors
	(req/min)	(ms, median)	(%)
-----------------------------------------------
20	92		171		0.00
40	182		182		0.00
60	270		201		0.00
80	358		214		0.00
100	448		240		0.00
120	534		289		0.00
140	623		371		0.00
160	677		562		0.00
180	716		1250		0.00
200	706		2966		0.00
220	694		4730		0.00
240	682		6270		0.00
260	673		7256		0.00
280	688		8896		0.00
300	701		9409		0.00
320	680		11100		0.00

It looks even less elastic than on a small instance. Just 12% increase in concurrent users turned server from moderately loaded to saturated and bumped latencies 2.5 times. (For small instance it was 33%). It is surprising how closely it follows theoretical predictions.

Just while we are at it, it would be curious to see the same effect on a 4-node Drupal cluster. For a saturation point I took 4x180 = 720 concurrent users and looked how it behaved when departing from that value:

Users	Throughput	Latency		Latency		Errors
	(req/min)	(ms, median)	(ms, average)	(%)
--------------------------------------------------------------
20	90		151		149		0.00
120	530		180		168		0.00
220	967		179		174		0.00
320	1407		202		202		0.02
420	1845		223		233		0.03
520	2308		276		299		0.03
620	2690		411		545		0.08
720	2717		1214		2330		0.12
820	2550		1946		6502		0.04
920	2480		2007		n/a		0.07
1020	2360		1878		n/a		0.12
1120	2300		1955		11200		0.13

The scatter in the error curve is due to very small amount of errors (one in several thousands), so it is purely statistical. Two interesting things here:

  1. It is even worse than single server. No comment about it.
  2. Median latencies start to differ significantly from the average one's and no longer grow linearly with the number of users. Which means that performance degradation is contributed to by few requests which have extremely big latencies, while most other latencies stay around 2 seconds. For the end user this probably should look like from time to time the site gets really stuck and then gets back to normal.

Be it as it may, but the cluster benchmarking again has to be split in two:

  • Throughput measurement at the saturation point which varies proportionally to the number of nodes.
  • Latency measurement using single node saturation point — 180 concurrent users.

Throughput

Nodes	Users	Throughput	Latency		Latency		Errors
		(req/min)	(ms, median)	(ms, average)	(%)
----------------------------------------------------------------------
1	180	724		1203		1827		0.00
2	360	1436		1190		1829		0.03
3	540	2091		1280		2150		0.06
4	720	2717		1214		2330		0.12

So we still have a great near-linear scalability here. It looks like it can go all the way up to 8-10 nodes.

Latency

Nodes	Users	Throughput	Latency		Latency		Errors
		(req/min)	(ms, median)	(ms, average)	(%)
----------------------------------------------------------------------
1	180	724		1203		1827		0.00
2	180	817		223		234		0.01
3	180	809		191		191		0.03
4	180	809		180		177		0.02

This figure is no surprise after seeing how "inelastic" is the load.

Errors and the Mystery Of a Failed Login

Occasional errors, however rare and transient they are, really spoil a nearly perfect scalability picture here. There are two things which clearly play a role here: number of nodes and node saturation, both of which lead to a high concurrency between local and slave updates to the database. After turning on KeepAlive option in Apache, error domain was reduced to only 403 error and two operations:

  1. opening "My Account" link and
  2. logging out

performed by "browser" users.

The benchmark is configured so that these are the only operations required to be logged in. So the answer is — the user thread failed to log in. In this case no HTTP error is generated, just "Sorry, unrecognized username or password." note on the front page. Since all of benchmark "browser" threads are logging in as the same user, my guess is that both logging in and logging out involves several database operations (read: not atomic) and since they are not transactional, then simultaneous logging in as the same user on different nodes will understandably lead to one of those logins to fail as updates to the database will be interleaved and the one that gets to update the database last gets the login handle. Clearly this is just a benchmark limitation and will never happen in real life.

So are we perfect now?


Viewing all articles
Browse latest Browse all 36

Trending Articles