DB Connection pool loosing connections (AKA: the not so transparent firewall)

I'm posting this because i've seen this problem many times, specially in corporate environments.

The symptom

You have an application (usually a java app running on an app container such as tomcat) that connects to a (any) database. You change from using a simple "connect to the db when you need it" to "use a connection pool to eliminate connection setup time".

You immediately rollback the connection pool change because the connections seems to die after some time (usually low usage or night time), making your application unresponsive.

The explanation

Well, there are many possible causes for the symptom, but one of the most common causes i've seen (specially in corporate environments) is due to the widespread use of transparent firewalls.

As a side note: be wary of anything that calls itself a transparent something, such as transparent firewalls or transparent proxies because they are, at best, translucent.

To understand the problem, we must first understand that the TCP stack closes a connection after it interchanges a number of packets with it's peer TCP layer. The message flow is complex because a TCP connection can be half closed, meaning one of the peers is done with it but the other is still sending. In order to close a connection, one of the peers can send a message with the FIN flag set (which triggers the FIN-WAIT statuses you can see with the netstat command). The FIN-WAIT status then progresses to FIN-WAIT-2, then to CLOSING and then to CLOSED.

Another alternative to connection closing is for one peer to send a TCP RST (reset) packet, indicating that the whole connection is in failure mode and shall be terminated.

TCP keep-alive was invented so that long lived connections could be kept alive. This includes exchanging ACK (acknowledge) packets between peers every now and then (usually configurable at the kernel level).

When keeping alive a connection, if a host doesn't hears from it's peer, then a lengthly series of attempts to reach the peer begins (we'll come back to this later). Upon failure, this series ends with the connection in the TIME-WAIT status that occupies a connection table slot for quite some time.

Enter the corporate firewall.

The most common usage of a corporate firewall is to actually do connection filtering. This means: telling what host can connect to what other host+port. I'm not really sure why a firewall is used for this, because most basic routers (and even switches) can do this kind of thing, albeit more in an NAT kind of way, not in an ip-hiding way.

Connection filtering works something like this:

  • When a connection is attempted, the firewall looks at the filtering rules to check if the 4-uple is in the list of allowed connections.
  • If it's not on the list, then one of the following happens:
    • an RST packet is sent to the source host, blocking the establishment of the connection. Not the best option, as it informs the source host that the remote ip does exists but doesn't like the port.
    • nothing is sent, letting the remote host retry until it timeouts. This is good, as the source host does not know the remote ip is valid, preventing address/port network scans.
  • If the 4-uple is on the list, then both of the following happen:
    • An entry representing the 4-uple is created in an in-memory table inside the firewall software.
    • The connection setup packet is forwarded to the destination host (oftentimes with some IP address changing).

So far so good.

Now, for a reason that escapes me, many (most?) corporate firewalls don't like low traffic long lived TCP connections. They don't like these connections badly enough to go to the extreme of detecting TCP keep alives in order to be able to euthanize these connections.

Low traffic long lived connections in corporate environments are darwinianly selected out by corporate firewalls. This natural selection is what's causing trouble in your environment.

The real problem is that the connections are not being terminated, just being euthanized. Yes, there is a big difference.

In a connection termination scenario, the firewall has to do two things:

  1. Send each real peer a TCP RST packet. This gives the peers a notification of the massacre. This notification enables the host's kernel to return an error to the application which will notice the connection is closed and be able to re-connect.
  2. Drop the 4-uple in the firewall valid connections table (thus blocking all following traffic).

But the connection is euthanized. This means the firewall usually does just step number 2.

The peers, not knowing about the transparent firewall, enter the lengthly process of trying to reach out each other. A process that is doomed to fail because the firewall is not forwarding the traffic.

Eventually, the (kernel) peers decide to give up and close the connection, letting the application know with an error. The key word in the previous sentence being eventually. In the mean time, the connections are zombies inside the kernel and the application is not notified until it's too late.

The Solution

The good news is that you can probably fix this problem without trying to let the people running the firewalls understand they are causing trouble and change defaults (after all, this is a silo'd corporate environment, what are your chances of effecting change on a different department?).

If you are using a standard connection pool, there are two options you have to set to avoid triggering the euthanasia:

  • Set the timeout (or keep alive) connection time to a lower value than the firewall's idle value. If the guys running the firewall don't tell you what this value is, you can always hand test it.
  • Set the option to do an active keep alive. This will effectively send a command to the database causing real traffic to go thru the connection. This involves a command without side effects (such as select 1 from dual, select current_time or something like that).

If you are not using a connection pool but an ad-hoc application-level protocol, you can always create a nop (no operation command) that responds some data.

This is enough to be my eleventh post.

No comments:

Post a Comment