Some have wondered “how fast” this implementation is. We have good news.
For a typical ‘SOHO gateway application’, on our SG-5100 appliance, using a 4 core C3558 Atom, the WireGuard implementation in pfSense 2.5 achieves 909Mbps, using iperf3, in a laboratory setting, setting the MSS to 1380. This is single-stream, with the firewall (‘pf’) enabled. This is 99.89% of what is theoretically possible.
Meanwhile, the “wireguard-go” port, running over FreeBSD 13-CURRENT on the same machine, achieves a mere 36.15% of theoretical maximum, only 329Mbps in the same setup.
Our XG-1541 appliance, using an 8 core Xeon 1541-D achieves 1542Mbps using the kernel WireGuard implementation, while the wireguard-go implementation, again, running on FreeBSD 13 achieves 1370Mbps.
Additional streams (using iperf3’s ‘-P’ switch) don’t offer much in the way of additional performance, since there is no way to spread the streams across cores, as the TCP ports are encoded too far into the packet for RSS to be able to see them. As a result, this is a single core performance bottleneck. For example, on the XG-1541:
- Single Stream: 1.52 Gbits/sec
- Two Stream: 1.53 Gbits/sec
- Four Stream: 1.52 Gbits/sec
- Eight Stream: 1.53 Gbits/sec
In order to test the performance of multiple cores, and to avoid the 1Gbps NIC limit, we also tested a better “site-to-site” configuration, consisting of four WireGuard tunnels, each with a routed /18 (four equal parts of a /16), and using TRex. Here the packets sent are 1420 octets each, including L2 headers, and we tested bi-directional throughput, as one can clearly imagine scenarios where the tunnels are ‘full’ in both directions in a true site-to-site application.
- For the SG-5100, using the WireGuard implementation in pfSense 2.5: 1846Mbps
- For the SG-5100, using wireguard-go on FreeBSD-13: 661Mbps
- For the XG-1541, using the WireGuard implementation in pfSense 2.5: 6510Mbps
- For the XG-1541, using wireguard-go on FreeBSD-13: 2700Mbps
Much has been made of the performance of WireGuard, including some well-meaning but flawed benchmarks by the WireGuard project. Without compression (which WireGuard does not offer) there is just no way to pass “1011Mbps” on a 1Gbps NIC. Since the Intel 82579 and I218 NICs used in that benchmarking are both 1Gbps NICs, incapable of passing more than 1000Mbps of traffic, the result is impossible, and this is all before any consideration of the framing overhead of both Ethernet and WireGuard.
In this post, Jason Donenfeld, the primary author of WireGuard, supplies a simple version of the WireGuard protocol overhead:
- 20-byte IPv4 header or 40 byte IPv6 header
- 8-byte UDP header
- 4-byte type
- 4-byte key index
- 8-byte nonce
- N-byte encrypted data
- 16-byte authentication tag
So, if you assume 1500 byte ethernet frames, the worst case (IPv6) winds up being 1500-(40+8+4+4+8+16), leaving N=1420 bytes. However, if you know ahead of time that you’re going to be using IPv4 exclusively, then you could get away with N=1440 bytes.
While “bytes” are commonly understood as 8 bit quantities, this is not necessarily so. Because of this, the networking community uses the term “octets” to describe 8-bit quantities, and I will do so in the rest of this post.
In addition to this 60 or 80 octets of overhead due to WireGuard’s framing, there is also an enclosed IP header (for IPv4 this is 20 octets, and for IPv6, 40 octets) and if you are using iperf3, there is also a TCP header, for an additional 20 octets. Put it all together, and in this test, you can only pass either 1380 octets of payload, or 1400 octets, depending on if you are willing to test only IPv4 or want to include IPv6.
We’re still not done with overheads though. We still need to account for Ethernet framing at both Layer 1 (PHY) and Layer 2 (MAC). Without considering 802.1q tags, the fields of an Ethernet frame include:
- Preamble (7 octets)
- Start of Frame Delimiter (1 octet)
- Destination (MAC) address
- Source (MAC) address
- Type (2 octets)
- Data, or payload. 46 - 1500 octets
- Frame Check Sequence, aka CRC (4 octets)
- Inter-Frame Gap (a silent period between frames equal to 12 octets)
For a 1500 octet payload, the total packet “on the wire” is 1538 octets. As above, the most we can send (assuming IPv4) using iperf3 is 1400 octets of actual data, using 138 octets of additional framing. As a direct consequence, on a 1Gbps link, WireGuard can’t transmit faster than 910Mbps, and this assumes no retransmits or other packet loss.
To compare WireGuard with IPsec, here is a sample of a recent test of IPsec using AES-GCM-128 across a range of packet sizes in the same quad-tunnel test harness, using the exact same boxes, here the SG-5100 4 core Atom. To keep things fair, this test is software-only. We didn’t enable QAT, just AES-NI.
At a 1518 octet L2 packet size, throughput is 1723.6Mbps vs WireGuard at a 1420 octet L2 packet size yielding 1846Mbps. WireGuard does indeed edge out IPsec here, but not by much. Some of this is due to overheads in FreeBSD’s OpenCrypto framework. We are addressing these, and this will be the subject of a future blog post. An additional difference here is that due to the differences in frame size, some fragmentation is undoubtedly occurring during the IPsec test. Also note the fun ‘knee’ here at 1024 octet frames.
So there you have it. 99.89% of theoretic maximum throughput on a 1Gbps link using a single core of a C3000 Atom CPU at 2.2GHz. More with multiple tunnels and larger CPUs. This is proof of our decision to invest in bringing kernel WireGuard to pfSense and FreeBSD. The world’s most popular firewall and the world’s greatest operating system can now both enjoy top-tier status where VPN performance is considered.