Demystifying High Availability In pfSense Software

What is High Availability?

High Availability (HA) is an important concept in Systems Engineering that eliminates single points of failure, ensuring continuous operation despite many hardware or software failures. Specific to networking, it allows a network engineer or operator to replace or repair a failed device or component without affecting services and end users. Conceptually, this is similar to RAID1 for data storage, when a pair of drives is mirrored for redundant data. This ensures that even if one drive fails, as a whole, the array does not lose data and is still accessible.

There is a phrase often used in IT: “Two is one and one is none”. Essentially, if you have two of something, you should assume you only have one of it. This is because hardware can fail at any time, for any reason, and often when you least expect it (see the Bathtub Curve used in reliability engineering). If you have a single point of failure, you could be moments away from a non-functional network.

6100HARackLifestyle

When considering whether you should have a High Availability setup, it’s always important to weigh the risks and what is acceptable relative to your organization's stated goals. If a particular function of a business is considered critical to the profitability of the business, but does not have redundancy in place, this is something that a good IT leader/engineer will want to remedy as soon as possible and, equally as important, test frequently to ensure proper functionality.

CARP - The “Beating Heart” of HA on pfSense® software

CARP, the Common Address Redundancy Protocol, is a protocol similar to VRRP that allows a set of hosts on a local area network to essentially “share” an IP address, monitor which device currently has “control” over it, and allows one of the devices to be in control of the address at all times between the available devices.

In a typical CARP configuration, two firewalls will each consume three IP addresses on CARP-enabled interfaces: One for the interface itself on each node plus a third shared IP address for the CARP Virtual IP address (VIP). This requirement must be met on every interface carrying user traffic, including WANs.

So how does it work? As an example, we’ll assume the following:

The switch(es) connected to both nodes on interfaces utilizing CARP must be capable of properly handling multicast traffic. Some low-end switches, such as those built into CPE devices, do not handle this traffic well. Additionally, some virtual switches in hypervisors require special configuration to allow not only the multicast traffic but the MAC address changes necessary for failover to function properly.
The WAN IP addresses are provided from upstream and must be static with at least a /29 to provide enough usable addresses for CARP.
There is only one WAN and one LAN interface being utilized on both appliances with the LAN interface utilizing a 192.168.1.0/24 address space and the WAN interface utilizing a 198.51.100.0/24 address space across the pair.
As an example of IP addressing for this setup:
1. Firewall 1 is the primary node and is using a LAN IP address of 192.168.1.2 and a WAN IP address of 198.51.100.201/24
2. Firewall 2 is the secondary node and is using a LAN IP address of 192.168.1.3 and a WAN IP address of 198.51.100.202/24
3. Both firewalls share a CARP IP address of 192.168.1.1 for LAN and the CARP VIP is used as the gateway for all clients. The shared CARP IP address for WAN is 198.51.100.200/24.

Once configured, Firewall 1 will send a heartbeat at a predetermined base frequency of once per second on the network to every CARP-configured interface letting Firewall 2 know that it’s still “alive” and processing traffic. By default, pfSense software will configure a skew for Firewall 2 of 100, which means that Firewall 2 will assume everything is fine as long as it keeps receiving heartbeats that don’t have a gap of more than the base + skew (1 second + 100/256ths of a second). While this covers the basics of how the mechanism works there is a bit more complexity here. See the High Availability documentation for additional details on the specifics of how the advertisements, base, and skew values operate.

If, at any point, Firewall 2 stops receiving heartbeats, known as advertisements, from Firewall 1 within the allotted time specified by the skew, it will assume Firewall 1 is no longer processing traffic and take over the PRIMARY role. This will “swing” control of the CARP IP address over to Firewall 2 and it will immediately begin processing traffic. Firewall 2 will then begin transmitting heartbeats on this interface and essentially act in Firewall 1’s place until it begins receiving heartbeats from Firewall 1 again. The default CARP behavior is to perform preemption, which means as soon as Firewall 1 comes back online, Firewall 2 will recognize a heartbeat faster than its own base and skew values, and recognize that Firewall 1 is ready to take over the CARP IP address again. Firewall 2 will change its status to SECONDARY, and be ready to take over should Firewall 1 stop sending heartbeats again.

All of this happens within seconds to minimize loss of network traffic.

pfsync - Keeping Track of the State of Things

pfsync is the state synchronization component of HA on pfSense software. It synchronizes the state tables of two devices to ensure that the packet filter (pf) component of the firewall operating system is ready to take over existing connections if the other node fails.

Without pfsync, using our previous example for CARP, if Firewall 1 were to failover to Firewall 2 without states being synchronized, any existing stateful connections would be broken and need to be re-established. This impacts protocols like TCP significantly with stateful firewall rules, but may not impact stateless protocols such as UDP or ICMP as severely. This most certainly can cause headaches for end users and applications.

Without stateful synchronization across the two nodes, every download, file transfer, backup, email client, and more would have at least a momentary “blip” in connectivity and may even require closing and reopening connections/programs. Since the goal of HA is to ensure network continuity when a problem happens, this is obviously not ideal.

As there is no authentication involved with pfsync, it’s important that this be run over a dedicated SYNC interface to ensure no manipulation of the state table by a bad actor on the same segment.

XMLRPC - Because Nobody Wants to do Things Twice

The XML remote procedure call (XMLRPC) component of HA is responsible for keeping the configuration synchronized between the primary firewall and the secondary firewall. This ensures that firewall rules, CARP VIPs, NAT configurations, and more are kept the same across both firewalls. While technically not necessary, without this an engineer would need to remember to make every single change on two firewalls. Doing so can provide an opportunity for human error that is avoided by automating this process.

The XMLRPC synchronization process runs every time a change is made on the primary firewall. It uses a dedicated username and password configured for just this purpose and typically will run over the same dedicated SYNC interface that pfsync utilizes.

Outbound NAT and DHCP Scopes - Tying it All Together

The last part of configuring High Availability is configuring DHCP for clients and NAT to work with your new HA setup, though technically not exclusively part of the HA components of pfSense software.

For Outbound NAT, by default pfSense software will utilize the WAN interface address for applying NAT to private network ranges behind the firewall to the public and routable IP addresses on the internet in IPv4. This is a problem for HA if left unchanged as the state table in pfsync would match the WAN IP address of the primary firewall if it fails over to the secondary firewall, thus breaking the state. To remedy this, configure Hybrid or Manual Outbound NAT and then rules created to NAT IPv4 traffic to the CARP VIP address on all relevant WAN connections. This will ensure that states are valid on both firewalls in a failover event.

DHCP scopes, whether on pfSense software or your own DHCP server, need to be updated to use the CARP VIP for the client gateway, rather than the actual interface IP address of the primary firewall. Otherwise, if the primary firewall goes offline, clients will not be able to reach their upstream gateway and they will lose all connectivity.

Finally, if there are any clients using static IP addresses, they will also need to have their gateway address updated to the CARP VIP, if using pfSense software for their gateway.

Conclusion

As we have shown, all of these critical components of HA in pfSense software work together to provide an excellent and robust solution that will keep your end users happy even in the event of hardware failure that would otherwise cripple your network.

We hope you found this information helpful. If you would like any assistance in implementing a HA solution for your organization, Netgate offers a Professional Services team that is able to provide start to finish HA implementation and support. For more information, please contact Netgate today.

See our documentation to learn more about High Availability on pfSense software. Netgate provides additional training content videos and documents at no cost. See Section 9 - High Availability for a helpful video on the Curriculum tab of our Training and Certification page.

Featured Story

USNS Mercy