Scaling Drupal stack with Galera: part 2, The Mystery of a Failed Login

This is the second part in in the series of posts about scaling Drupal stack. The first part can be found here.

GLB has been fixed to support unlimited connections and now I can benchmark Drupal stack cluster on large EC2 instances. What I'm looking for here mainly is how much Galera synchronization overhead affects performance here. With small instances everything was pretty much clear: single core hindered by Xen was an easy traget, and scalability was predictably linear. Large instance is dual core, and Xen interference is minimal, Galera synchronization and serialization effect must be more pronounced. How desperately bad is it?
We'll start with looking at HTTP load "elasticity" on large EC2 instance:

Users	Throughput	Latency		Errors
	(req/min)	(ms, median)	(%)
-----------------------------------------------
20	92		171		0.00
40	182		182		0.00
60	270		201		0.00
80	358		214		0.00
100	448		240		0.00
120	534		289		0.00
140	623		371		0.00
160	677		562		0.00
180	716		1250		0.00
200	706		2966		0.00
220	694		4730		0.00
240	682		6270		0.00
260	673		7256		0.00
280	688		8896		0.00
300	701		9409		0.00
320	680		11100		0.00

It looks even less elastic than on a small instance. Just 12% increase in concurrent users turned server from moderately loaded to saturated and bumped latencies 2.5 times. (For small instance it was 33%). It is surprising how closely it follows theoretical predictions.

Just while we are at it, it would be curious to see the same effect on a 4-node Drupal cluster. For a saturation point I took 4x180 = 720 concurrent users and looked how it behaved when departing from that value:

Users	Throughput	Latency		Latency		Errors
	(req/min)	(ms, median)	(ms, average)	(%)
--------------------------------------------------------------
20	90		151		149		0.00
120	530		180		168		0.00
220	967		179		174		0.00
320	1407		202		202		0.02
420	1845		223		233		0.03
520	2308		276		299		0.03
620	2690		411		545		0.08
720	2717		1214		2330		0.12
820	2550		1946		6502		0.04
920	2480		2007		n/a		0.07
1020	2360		1878		n/a		0.12
1120	2300		1955		11200		0.13

The scatter in the error curve is due to very small amount of errors (one in several thousands), so it is purely statistical. Two interesting things here:

It is even worse than single server. No comment about it.
Median latencies start to differ significantly from the average one's and no longer grow linearly with the number of users. Which means that performance degradation is contributed to by few requests which have extremely big latencies, while most other latencies stay around 2 seconds. For the end user this probably should look like from time to time the site gets really stuck and then gets back to normal.

Be it as it may, but the cluster benchmarking again has to be split in two:

Throughput measurement at the saturation point which varies proportionally to the number of nodes.
Latency measurement using single node saturation point — 180 concurrent users.

Throughput

Nodes	Users	Throughput	Latency		Latency		Errors
		(req/min)	(ms, median)	(ms, average)	(%)
----------------------------------------------------------------------
1	180	724		1203		1827		0.00
2	360	1436		1190		1829		0.03
3	540	2091		1280		2150		0.06
4	720	2717		1214		2330		0.12

So we still have a great near-linear scalability here. It looks like it can go all the way up to 8-10 nodes.

Latency

Nodes	Users	Throughput	Latency		Latency		Errors
		(req/min)	(ms, median)	(ms, average)	(%)
----------------------------------------------------------------------
1	180	724		1203		1827		0.00
2	180	817		223		234		0.01
3	180	809		191		191		0.03
4	180	809		180		177		0.02

This figure is no surprise after seeing how "inelastic" is the load.

Errors and the Mystery Of a Failed Login

Occasional errors, however rare and transient they are, really spoil a nearly perfect scalability picture here. There are two things which clearly play a role here: number of nodes and node saturation, both of which lead to a high concurrency between local and slave updates to the database. After turning on KeepAlive option in Apache, error domain was reduced to only 403 error and two operations:

opening "My Account" link and
logging out

performed by "browser" users.

The benchmark is configured so that these are the only operations required to be logged in. So the answer is — the user thread failed to log in. In this case no HTTP error is generated, just "Sorry, unrecognized username or password." note on the front page. Since all of benchmark "browser" threads are logging in as the same user, my guess is that both logging in and logging out involves several database operations (read: not atomic) and since they are not transactional, then simultaneous logging in as the same user on different nodes will understandably lead to one of those logins to fail as updates to the database will be interleaved and the one that gets to update the database last gets the login handle. Clearly this is just a benchmark limitation and will never happen in real life.

So are we perfect now?

Scaling Drupal stack with Galera: part 2, The Mystery of a Failed Login

Throughput

Latency

Errors and the Mystery Of a Failed Login

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...