HAProxy – Experimental evaluation of the performance

HAProxy max throughput

HAProxy max throughput overview

When I was evaluating the influence of the number of fronts on the performance of WordPress (see the post Scaling-out WordPress – Performance Measures – Number of fronts influence), I was wondering what should be the number of request per seconds HAProxy was able to sustain.

Some benchmarks on the HAProxy web site show that HAProxy is able to manage up to 25K req/sec at 8KB with a 10GB Nic.
I decided to try to determine the limit on my testing virtualized environment.

In this test, I use the standard “It works” apache page (177 bytes)

Configuration

HAProxy configuration

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        maxconn 4096
        user haproxy
        group haproxy
        daemon
defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        retries 3
        option redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      50000
        srvtimeout      50000

listen webcluster 192.168.100.209:80
        mode http
        stats enable
        stats auth admin:[PASSWORD]
        balance roundrobin
        option forwardfor
        server tu-web-01 192.168.100.210:80 check
        server tu-web-02 192.168.100.218:80 check
        server tu-web-03 192.168.100.172:80 check
        server tu-web-04 192.168.100.167:80 check

Testing environment

Servers:

ade-esxi-01: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz; No Hyperthreading; 8 GB DDR2 800; Raid card 3ware 9650SE+BBU; 4x1TB RAID10; WD Caviar Black 7200t/min (WD1001FALS)
ade-esxi-02: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz; No Hyperthreading; 8 GB DDR2 800; Raid card HP P400+BBU; 4x1TB RAID10; WD Caviar Black 7200t/min (WD1001FALS)
ade-esxi-03: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz; Hyperthreading active; 8 GB DDR3 1066;4x1TB; WD Caviar Black 7200t/min (WD1001FALS)

Network:

1GB LAN
2 switch DGS-2208
1GB NIC with MTU 1500 on 3 servers

VM:

on ade-esxi-01
	tu-web-02: 4vcpu; 2GB; Ubuntu 12.04x64 server; Server version: Apache/2.2.22 (Ubuntu); Server MPM: Prefork
on ade-esxi-02
	tu-web-01: 4vcpu; 2GB; Ubuntu 12.04x64 server; Server version: Apache/2.2.22 (Ubuntu); Server MPM: Prefork	
on ade-esxi-03
	tu-lb-01: 1vcpu; 512MB; Ubuntu 12.04x64 server; HA-Proxy version 1.4.18 2011/09/16
	tu-web-03: 2vcpu; 2GB; Ubuntu 12.04x64 server; Server version: Apache/2.2.22 (Ubuntu); Server MPM: Prefork
	tu-web-03: 2vcpu; 2GB; Ubuntu 12.04x64 server; Server version: Apache/2.2.22 (Ubuntu); Server MPM: Prefork

Evaluate the limit of apache throughput on my testing environment

Test #22, 10 concurrent get connections

ab -k -c 10 -n 1000000 http://tu-web-01/
Requests per second:    12570.77 [#/sec] (mean)

Number of apache2 child process : 17
Iostat result (%cpu usage) : 51%
Free Mem: 1804MB

Test #23, 20 concurrent get connections

ab -k -c 20 -n 1000000 http://tu-web-01/
Requests per second:    19419.05 [#/sec] (mean)

Number of apache2 child process : 26
Iostat result (%cpu usage) : 96%
Free Mem: 1795MB

Test #24, 30 concurrent get connections

ab -k -c 30 -n 1000000 http://tu-web-01/
Requests per second:    20272.52 [#/sec] (mean)

Number of apache2 child process : 37
Iostat result (%cpu usage) : 98.75%

Test #25, 150 concurrent get connections (default max Child apache2 processes)

ab -k -c 150 -n 1000000 http://tu-web-01/
Requests per second:    17088.43 [#/sec] (mean)

Number of apache2 child process : 150
Iostat result (%cpu usage) : 100%
Free Mem: 1761MB

Conclusion

“30 Concurrent requests” give us best throughput of approx 20 000 req/s (20272.52) and saturates the CPU (98,75%). It is the best test.
The tests are done with “KeepAlived” connections.

Evaluate the limit of HAProxy throughput on my testing environment

Test #26, 150 concurrent get connections on the load balancer, 37.5 connections on each server

ab -k -c 150 -n 1000000 http://tu-lb-01/
Requests per second:    29331.99 [#/sec] (mean)

On tu-lb-01:
Iostat result (%cpu usage) : 93% (%System = 88%)
Free Mem: 413MB

On tu-web-01:
Number of apache2 child process : 61
Iostat result (%cpu usage) : 40%

Test #27, 600 concurrent get connections on the load balancer, 150 connections on each server

ab -k -c 600 -n 1000000 http://tu-lb-01/
Requests per second:    28130.55 [#/sec] (mean)

On tu-lb-01:
Iostat result (%cpu usage) : 95% (%System = 88%)
Free Mem: 402MB

On tu-web-01:
Number of apache2 child process : 150
Iostat result (%cpu usage) : 44%

Conclusion

In this config, HAProxy gives us a throughput of approx 30 000 req/s (29331.99); the CPU is at 95%.
Memory Usage on the LB is less than 200 MB

The CPU of the 1st front is around 40%-50%

It seems that the bottleneck on the load balancer is the CPU, so I will try to increase it.

Try to increase the CPU on HAProxy

Test #28, 600 concurrent get connections on the load balancer, LB with 4 VCPU (no modification in haproxy config)

ab -k -c 600 -n 1000000 http://tu-lb-01/
Requests per second:    28542.19 [#/sec] (mean)

On tu-lb-01:
Iostat result (%cpu usage) : 22% (%System = 20.15%)
Free Mem: 414MB

On tu-web-01:
Number of apache2 child process : 143
Iostat result (%cpu usage) : 42%

Rem:
This result is expected. As the configuration has not changed, HAProxy uses only one core, so 22% on 4 Cores (1 used and 3 not used) is approx equivalent to 88% on 1 Core.

Test #29, 600 concurrent get connections on the load balancer, LB with 4 VCPU (modification in haproxy config nbproc=4)

ab -k -c 600 -n 1000000 http://tu-lb-01/
Requests per second:    28359.73 [#/sec] (mean)

On tu-lb-01:
Iostat result (%cpu usage) : 22% (%System = 26%)
Free Mem: 414MB

On tu-web-01:
Number of apache2 child process : 150
Iostat result (%cpu usage) : 41%

Conclusion

The graph below shows the results:

HAProxy max throughput

HAProxy max throughput

The results are coherent with the benchmarks on the HAProxy web site (approx 35K req/sec with 177 Bytes with 10GB NIC)

Increasing the number of cpu has no effect on the throughput of HAProxy for this workload.
It could be explained by the fact that the HAProxy process is not multithread and that, for this workload, everything has to be managed in one HAProxy process.
You can find an explanation of the design choice of HAProxy here http://haproxy.1wt.eu/

HAProxy implements an event-driven, single-process model which enables support for very high number of simultaneous connections at very high speeds. Multi-process or multi-threaded models can rarely cope with thousands of connections because of memory limits, system scheduler limits, and lock contention everywhere. Event-driven models do not have these problems because implementing all the tasks in user-space allows a finer resource and time management. The down side is that those programs generally don’t scale well on multi-processor systems. That’s the reason why they must be optimized to get the most work done from every CPU cycle.

There is a good explanation here http://1wt.eu/articles/2006_lb/index_07.html of what has to manage HAProxy.

In this config the max thoughput of HAProxy is approx 30K req/s and a small config (1vcpu, 512 MB allows getting this maximum performance)

In this config, as an apache2 static page gets 20K req/s, HAProxy improves by 50% the thoughput.

Keep in mid that 20K req/s is huge, it corresponds to 72 million reqs/hour.

Facebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *