This is the second part in in the series of posts about scaling Drupal stack. The first part can be found here.
GLB has been fixed to support unlimited connections and now I can benchmark Drupal stack cluster on large EC2 instances. What I'm looking for here mainly is how much Galera synchronization overhead affects performance here. With small instances everything was pretty much clear: single core hindered by Xen was an easy traget, and scalability was predictably linear. Large instance is dual core, and Xen interference is minimal, Galera synchronization and serialization effect must be more pronounced. How desperately bad is it?
We'll start with looking at HTTP load "elasticity" on large EC2 instance:
Users Throughput Latency Errors (req/min) (ms, median) (%) ----------------------------------------------- 20 92 171 0.00 40 182 182 0.00 60 270 201 0.00 80 358 214 0.00 100 448 240 0.00 120 534 289 0.00 140 623 371 0.00 160 677 562 0.00 180 716 1250 0.00 200 706 2966 0.00 220 694 4730 0.00 240 682 6270 0.00 260 673 7256 0.00 280 688 8896 0.00 300 701 9409 0.00 320 680 11100 0.00
It looks even less elastic than on a small instance. Just 12% increase in concurrent users turned server from moderately loaded to saturated and bumped latencies 2.5 times. (For small instance it was 33%). It is surprising how closely it follows theoretical predictions.
Just while we are at it, it would be curious to see the same effect on a 4-node Drupal cluster. For a saturation point I took 4x180 = 720 concurrent users and looked how it behaved when departing from that value:
Users Throughput Latency Latency Errors (req/min) (ms, median) (ms, average) (%) -------------------------------------------------------------- 20 90 151 149 0.00 120 530 180 168 0.00 220 967 179 174 0.00 320 1407 202 202 0.02 420 1845 223 233 0.03 520 2308 276 299 0.03 620 2690 411 545 0.08 720 2717 1214 2330 0.12 820 2550 1946 6502 0.04 920 2480 2007 n/a 0.07 1020 2360 1878 n/a 0.12 1120 2300 1955 11200 0.13
The scatter in the error curve is due to very small amount of errors (one in several thousands), so it is purely statistical. Two interesting things here:
- It is even worse than single server. No comment about it.
- Median latencies start to differ significantly from the average one's and no longer grow linearly with the number of users. Which means that performance degradation is contributed to by few requests which have extremely big latencies, while most other latencies stay around 2 seconds. For the end user this probably should look like from time to time the site gets really stuck and then gets back to normal.
Be it as it may, but the cluster benchmarking again has to be split in two:
- Throughput measurement at the saturation point which varies proportionally to the number of nodes.
- Latency measurement using single node saturation point — 180 concurrent users.
Throughput
Nodes Users Throughput Latency Latency Errors (req/min) (ms, median) (ms, average) (%) ---------------------------------------------------------------------- 1 180 724 1203 1827 0.00 2 360 1436 1190 1829 0.03 3 540 2091 1280 2150 0.06 4 720 2717 1214 2330 0.12
So we still have a great near-linear scalability here. It looks like it can go all the way up to 8-10 nodes.
Latency
Nodes Users Throughput Latency Latency Errors (req/min) (ms, median) (ms, average) (%) ---------------------------------------------------------------------- 1 180 724 1203 1827 0.00 2 180 817 223 234 0.01 3 180 809 191 191 0.03 4 180 809 180 177 0.02
This figure is no surprise after seeing how "inelastic" is the load.
Errors and the Mystery Of a Failed Login
Occasional errors, however rare and transient they are, really spoil a nearly perfect scalability picture here. There are two things which clearly play a role here: number of nodes and node saturation, both of which lead to a high concurrency between local and slave updates to the database. After turning on KeepAlive option in Apache, error domain was reduced to only 403 error and two operations:
- opening "My Account" link and
- logging out
performed by "browser" users.
The benchmark is configured so that these are the only operations required to be logged in. So the answer is — the user thread failed to log in. In this case no HTTP error is generated, just "Sorry, unrecognized username or password." note on the front page. Since all of benchmark "browser" threads are logging in as the same user, my guess is that both logging in and logging out involves several database operations (read: not atomic) and since they are not transactional, then simultaneous logging in as the same user on different nodes will understandably lead to one of those logins to fail as updates to the database will be interleaved and the one that gets to update the database last gets the login handle. Clearly this is just a benchmark limitation and will never happen in real life.
So are we perfect now?