Troubleshooting Server Load Balancing

This section describes how to identify, troubleshoot, and resolve the most common issues encountered by users with server load balancing.

Connections not being balanced

Connections not being balanced is most always a failure of the testing methodology being used, and is usually specific to HTTP. Web browsers will commonly keep connections to a web server open, and hitting refresh re-uses the existing connection. A single connection will never be changed to another balanced server. Another common issue is the web browser cache, where the browser never actually requests the page again. It is preferable to use a command line tool such as curl for testing of this nature, because it ensures the test is not impacted by the problems inherent in testing with web browsers. curl has no cache, and opens a new connection to the server each time it is run. More information on curl can be found in Verifying load balancing.

If sticky connections are enabled, ensure testing is performed from multiple source IP addresses. Tests from a single source IP address will go to a single server unless a long period of time elapses between connection attempts.

Down server not marked as offline

If a server goes down but is not marked as offline, it is because the monitoring performed by the load balancing daemon believes it is still up and running. If using a TCP monitor, the TCP port must still be accepting connections. The service on that port could be broken in numerous ways and still answer TCP connections. For ICMP monitors, this problem is exacerbated, as servers can be hung or crashed with no listening services at all and still answer to pings.

Live server not marked as online

If a server is online, but not marked as online, it is because it isn’t online from the perspective of the load balancing daemon monitors. The server must answer on the TCP port used or respond to pings sourced from the IP address of the firewall interface closest to the server.

For example, if the server is on the LAN, the server must answer requests initiated from the LAN IP address of the firewall. To verify this for ICMP monitors, browse to Diagnostics > Ping and ping the server IP address using the interface where the server is located.

For TCP monitors, use Diagnostics > Test Port, and choose the firewall’s LAN interface as the source, and the web server IP address and port as the target.

Another way to test is from a shell prompt on the firewall, either using the console or ssh menu option 8 and the nc command:

# nc -vz 10.6.0.12 80
nc: connect to 10.6.0.12 port 80 (tcp) failed: Operation timed out

And here is an example of a successful connection:

# nc -vz 10.6.0.12 80
Connection to 10.6.0.12 80 port [tcp/http] succeeded!

If the connection fails, troubleshoot further on the web server.

Unable to reach a virtual server from a client in the same subnet as the pool server

Client systems in the same subnet as the pool servers will fail to properly connect using this load balancing method. relayd forwards the connection to the web server with the source address of the client intact. The server will then try to respond directly to the client. If the server has a direct path to the client, e.g. through a locally connected NIC in the same subnet, it will not flow back through the firewall properly and the client will receive the reply from the server’s local IP address and not the IP address in relayd. Then, due to the fact that the server IP address is incorrect from the perspective of the client, the connection is dropped as being invalid.

One way around this is by using manual outbound NAT and crafting a manual outbound NAT rule so that traffic leaving the internal interface (LAN) coming from the LAN subnet, going to the web servers, gets translated to the interface address of LAN. That way the traffic appears to originate from the firewall, and the server will respond back to the firewall, which then relays the traffic back to the client using the expected addresses. The original client source IP address is lost in the process, but the only other viable solution is to move the servers to a different network segment.