When I was evaluating the influence of the number of fronts on the performance of WordPress (see the post Scaling-out WordPress – Performance Measures – Number of fronts influence), I was wondering what should be the number of request per seconds HAProxy was able to sustain.
Some benchmarks on the HAProxy web site show that HAProxy is able to manage up to 25K req/sec at 8KB with a 10GB Nic.
I decided to try to determine the limit on my testing virtualized environment.
In this test, I use the standard “It works” apache page (177 bytes)
Configuration
HAProxy configuration
global log 127.0.0.1 local0 log 127.0.0.1 local1 notice maxconn 4096 user haproxy group haproxy daemon defaults log global mode http option httplog option dontlognull retries 3 option redispatch maxconn 2000 contimeout 5000 clitimeout 50000 srvtimeout 50000 listen webcluster 192.168.100.209:80 mode http stats enable stats auth admin:[PASSWORD] balance roundrobin option forwardfor server tu-web-01 192.168.100.210:80 check server tu-web-02 192.168.100.218:80 check server tu-web-03 192.168.100.172:80 check server tu-web-04 192.168.100.167:80 check
Testing environment
Servers:
ade-esxi-01: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz; No Hyperthreading; 8 GB DDR2 800; Raid card 3ware 9650SE+BBU; 4x1TB RAID10; WD Caviar Black 7200t/min (WD1001FALS) ade-esxi-02: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz; No Hyperthreading; 8 GB DDR2 800; Raid card HP P400+BBU; 4x1TB RAID10; WD Caviar Black 7200t/min (WD1001FALS) ade-esxi-03: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz; Hyperthreading active; 8 GB DDR3 1066;4x1TB; WD Caviar Black 7200t/min (WD1001FALS)
Network:
1GB LAN 2 switch DGS-2208 1GB NIC with MTU 1500 on 3 servers
VM:
on ade-esxi-01 tu-web-02: 4vcpu; 2GB; Ubuntu 12.04x64 server; Server version: Apache/2.2.22 (Ubuntu); Server MPM: Prefork on ade-esxi-02 tu-web-01: 4vcpu; 2GB; Ubuntu 12.04x64 server; Server version: Apache/2.2.22 (Ubuntu); Server MPM: Prefork on ade-esxi-03 tu-lb-01: 1vcpu; 512MB; Ubuntu 12.04x64 server; HA-Proxy version 1.4.18 2011/09/16 tu-web-03: 2vcpu; 2GB; Ubuntu 12.04x64 server; Server version: Apache/2.2.22 (Ubuntu); Server MPM: Prefork tu-web-03: 2vcpu; 2GB; Ubuntu 12.04x64 server; Server version: Apache/2.2.22 (Ubuntu); Server MPM: Prefork
Evaluate the limit of apache throughput on my testing environment
Test #22, 10 concurrent get connections
ab -k -c 10 -n 1000000 http://tu-web-01/ Requests per second: 12570.77 [#/sec] (mean)
Number of apache2 child process : 17
Iostat result (%cpu usage) : 51%
Free Mem: 1804MB
Test #23, 20 concurrent get connections
ab -k -c 20 -n 1000000 http://tu-web-01/ Requests per second: 19419.05 [#/sec] (mean)
Number of apache2 child process : 26
Iostat result (%cpu usage) : 96%
Free Mem: 1795MB
Test #24, 30 concurrent get connections
ab -k -c 30 -n 1000000 http://tu-web-01/ Requests per second: 20272.52 [#/sec] (mean)
Number of apache2 child process : 37
Iostat result (%cpu usage) : 98.75%
Test #25, 150 concurrent get connections (default max Child apache2 processes)
ab -k -c 150 -n 1000000 http://tu-web-01/ Requests per second: 17088.43 [#/sec] (mean)
Number of apache2 child process : 150
Iostat result (%cpu usage) : 100%
Free Mem: 1761MB
Conclusion
“30 Concurrent requests” give us best throughput of approx 20 000 req/s (20272.52) and saturates the CPU (98,75%). It is the best test.
The tests are done with “KeepAlived” connections.
Evaluate the limit of HAProxy throughput on my testing environment
Test #26, 150 concurrent get connections on the load balancer, 37.5 connections on each server
ab -k -c 150 -n 1000000 http://tu-lb-01/ Requests per second: 29331.99 [#/sec] (mean)
On tu-lb-01:
Iostat result (%cpu usage) : 93% (%System = 88%)
Free Mem: 413MB
On tu-web-01:
Number of apache2 child process : 61
Iostat result (%cpu usage) : 40%
Test #27, 600 concurrent get connections on the load balancer, 150 connections on each server
ab -k -c 600 -n 1000000 http://tu-lb-01/ Requests per second: 28130.55 [#/sec] (mean)
On tu-lb-01:
Iostat result (%cpu usage) : 95% (%System = 88%)
Free Mem: 402MB
On tu-web-01:
Number of apache2 child process : 150
Iostat result (%cpu usage) : 44%
Conclusion
In this config, HAProxy gives us a throughput of approx 30 000 req/s (29331.99); the CPU is at 95%.
Memory Usage on the LB is less than 200 MB
The CPU of the 1st front is around 40%-50%
It seems that the bottleneck on the load balancer is the CPU, so I will try to increase it.
Try to increase the CPU on HAProxy
Test #28, 600 concurrent get connections on the load balancer, LB with 4 VCPU (no modification in haproxy config)
ab -k -c 600 -n 1000000 http://tu-lb-01/ Requests per second: 28542.19 [#/sec] (mean)
On tu-lb-01:
Iostat result (%cpu usage) : 22% (%System = 20.15%)
Free Mem: 414MB
On tu-web-01:
Number of apache2 child process : 143
Iostat result (%cpu usage) : 42%
Rem:
This result is expected. As the configuration has not changed, HAProxy uses only one core, so 22% on 4 Cores (1 used and 3 not used) is approx equivalent to 88% on 1 Core.
Test #29, 600 concurrent get connections on the load balancer, LB with 4 VCPU (modification in haproxy config nbproc=4)
ab -k -c 600 -n 1000000 http://tu-lb-01/ Requests per second: 28359.73 [#/sec] (mean)
On tu-lb-01:
Iostat result (%cpu usage) : 22% (%System = 26%)
Free Mem: 414MB
On tu-web-01:
Number of apache2 child process : 150
Iostat result (%cpu usage) : 41%
Conclusion
The graph below shows the results:
The results are coherent with the benchmarks on the HAProxy web site (approx 35K req/sec with 177 Bytes with 10GB NIC)
Increasing the number of cpu has no effect on the throughput of HAProxy for this workload.
It could be explained by the fact that the HAProxy process is not multithread and that, for this workload, everything has to be managed in one HAProxy process.
You can find an explanation of the design choice of HAProxy here http://haproxy.1wt.eu/
HAProxy implements an event-driven, single-process model which enables support for very high number of simultaneous connections at very high speeds. Multi-process or multi-threaded models can rarely cope with thousands of connections because of memory limits, system scheduler limits, and lock contention everywhere. Event-driven models do not have these problems because implementing all the tasks in user-space allows a finer resource and time management. The down side is that those programs generally don’t scale well on multi-processor systems. That’s the reason why they must be optimized to get the most work done from every CPU cycle.
There is a good explanation here http://1wt.eu/articles/2006_lb/index_07.html of what has to manage HAProxy.
In this config the max thoughput of HAProxy is approx 30K req/s and a small config (1vcpu, 512 MB allows getting this maximum performance)
In this config, as an apache2 static page gets 20K req/s, HAProxy improves by 50% the thoughput.
Keep in mid that 20K req/s is huge, it corresponds to 72 million reqs/hour.





