DB Connection pool loosing connections (AKA: the not so transparent firewall)

I'm posting this because i've seen this problem many times, specially in corporate environments.

The symptom

You have an application (usually a java app running on an app container such as tomcat) that connects to a (any) database. You change from using a simple "connect to the db when you need it" to "use a connection pool to eliminate connection setup time".

You immediately rollback the connection pool change because the connections seems to die after some time (usually low usage or night time), making your application unresponsive.

The explanation

Well, there are many possible causes for the symptom, but one of the most common causes i've seen (specially in corporate environments) is due to the widespread use of transparent firewalls.

As a side note: be wary of anything that calls itself a transparent something, such as transparent firewalls or transparent proxies because they are, at best, translucent.

To understand the problem, we must first understand that the TCP stack closes a connection after it interchanges a number of packets with it's peer TCP layer. The message flow is complex because a TCP connection can be half closed, meaning one of the peers is done with it but the other is still sending. In order to close a connection, one of the peers can send a message with the FIN flag set (which triggers the FIN-WAIT statuses you can see with the netstat command). The FIN-WAIT status then progresses to FIN-WAIT-2, then to CLOSING and then to CLOSED.

Another alternative to connection closing is for one peer to send a TCP RST (reset) packet, indicating that the whole connection is in failure mode and shall be terminated.

TCP keep-alive was invented so that long lived connections could be kept alive. This includes exchanging ACK (acknowledge) packets between peers every now and then (usually configurable at the kernel level).

When keeping alive a connection, if a host doesn't hears from it's peer, then a lengthly series of attempts to reach the peer begins (we'll come back to this later). Upon failure, this series ends with the connection in the TIME-WAIT status that occupies a connection table slot for quite some time.

Enter the corporate firewall.

The most common usage of a corporate firewall is to actually do connection filtering. This means: telling what host can connect to what other host+port. I'm not really sure why a firewall is used for this, because most basic routers (and even switches) can do this kind of thing, albeit more in an NAT kind of way, not in an ip-hiding way.

Connection filtering works something like this:

  • When a connection is attempted, the firewall looks at the filtering rules to check if the 4-uple is in the list of allowed connections.
  • If it's not on the list, then one of the following happens:
    • an RST packet is sent to the source host, blocking the establishment of the connection. Not the best option, as it informs the source host that the remote ip does exists but doesn't like the port.
    • nothing is sent, letting the remote host retry until it timeouts. This is good, as the source host does not know the remote ip is valid, preventing address/port network scans.
  • If the 4-uple is on the list, then both of the following happen:
    • An entry representing the 4-uple is created in an in-memory table inside the firewall software.
    • The connection setup packet is forwarded to the destination host (oftentimes with some IP address changing).

So far so good.

Now, for a reason that escapes me, many (most?) corporate firewalls don't like low traffic long lived TCP connections. They don't like these connections badly enough to go to the extreme of detecting TCP keep alives in order to be able to euthanize these connections.

Low traffic long lived connections in corporate environments are darwinianly selected out by corporate firewalls. This natural selection is what's causing trouble in your environment.

The real problem is that the connections are not being terminated, just being euthanized. Yes, there is a big difference.

In a connection termination scenario, the firewall has to do two things:

  1. Send each real peer a TCP RST packet. This gives the peers a notification of the massacre. This notification enables the host's kernel to return an error to the application which will notice the connection is closed and be able to re-connect.
  2. Drop the 4-uple in the firewall valid connections table (thus blocking all following traffic).

But the connection is euthanized. This means the firewall usually does just step number 2.

The peers, not knowing about the transparent firewall, enter the lengthly process of trying to reach out each other. A process that is doomed to fail because the firewall is not forwarding the traffic.

Eventually, the (kernel) peers decide to give up and close the connection, letting the application know with an error. The key word in the previous sentence being eventually. In the mean time, the connections are zombies inside the kernel and the application is not notified until it's too late.

The Solution

The good news is that you can probably fix this problem without trying to let the people running the firewalls understand they are causing trouble and change defaults (after all, this is a silo'd corporate environment, what are your chances of effecting change on a different department?).

If you are using a standard connection pool, there are two options you have to set to avoid triggering the euthanasia:

  • Set the timeout (or keep alive) connection time to a lower value than the firewall's idle value. If the guys running the firewall don't tell you what this value is, you can always hand test it.
  • Set the option to do an active keep alive. This will effectively send a command to the database causing real traffic to go thru the connection. This involves a command without side effects (such as select 1 from dual, select current_time or something like that).

If you are not using a connection pool but an ad-hoc application-level protocol, you can always create a nop (no operation command) that responds some data.

This is enough to be my eleventh post.


Will extreme poverty ever end? Nuru International


after watching a Google Tech Talk by Jake Harriman, founder and CEO of Nuru International (nuruinternational.org), titled Will Extreme Poverty Ever End?, three things got into my mind.

The very first one was to actually go and make a monthly donation. Go ahead and do so yourself. Not only you will never die of a severe case of donation, it also increases your overall happiness, which reduces stress. So go and do something good for others and yourself (or yourself and others).

The second thing i couldn't stop, is wondering about this guy. A US Marine that went to Afghanistan (something that some people, including me, considers to be one of those bad things the US do abroad). This guy decided to keep fighting terrorism on very different front lines: those of NGOs and humanitarian relief.

I can't help to think that when a soldier decides that she or he will be more effective fighting terrorism by eradicating poverty, then a lot of good things are going on in the world. Many many more good things that we would assume if we just listen to mainstream news organizations, or old folks saying things used to be better.

While i never had the paralysis problem when facing a decision of letting go of things and start anew, it is very telling that changes like the one Jake Harriman did are not only possible, but feasible.

Bold move.

The third thing that got into my mind is innovation. You can tell that this guy took a lot of bagage from the Marines (well, you can't stop being who you are) and applied it to the field of humanitarian relief. Nuru works like a contractor. They actually don't try to reinvent the weel, but to figure out the most effective weel for each type of task. Nuru partners with other NGOs so that both NGOs can better fulfill their respective missions. And Nuru is trying to actually measure things to figure out what works and what doesn't, to try new things and fail fast.

A very agile style.

I couldn't help to compare this approach with the Record Labels's and Entertainment Industry's. We recently were bombarded with the whole SOPA/PIPA debate (even here in Argentina was all over the news). These industries don't try new things. They don't even like others to try new things. In the case of Hollywood, they even disregard they own history. Independent studios relocated to the US west coast because of the many patents and laws on the east coast were stifling innovation and new business models. And these independent studios grew to be Hollywood.

So, instead of collaborating with Napster, Labels shut it down. Neither got the chance to better fulfill their respective missions.

Nothing like contrast.

This is enough to be my tenth post.