2012-12-24

Another year is ending. Did you make a donation?


Another year. Time goes by faster and faster.

I usually wrap up the year by making some donations.

This time i donated to:

  • Wikimedia Foundation. The donation page is here (english) or here (spanish).
  • The SETI Institute. The donation page is here.

Also, all this year i've been making monthly donations to the World Food Program (english or spanish) and Nuru International (here).

So, end this year doing something for others. That is money well spent and it actually makes you feel better about yourself.

This is enough to be my fourteenth post.

How much memory does a JVM need?


Update: fixed some typos pointed out by Ezequiel and Lecko.

At this point in IT affairs, we should change the old US saying to go something like "Nothing is certain but death, taxes and a JVM Full GC".

Because, no matter what, one thing you can count on is that a JVM will eventualy perform a Full Stop the World Garbage Collection. This is true with Concurrent Mark and Sweep (-XX:+UseParNewGC -XX:+UseConcMarkSweepGC), and with the relatively new G1 garbage collector on Java 6 (-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC).

Hence, the rigth thing to do is to give the JVM as much memory as you can afford. Right? Well, not necessarily.

Some ground leveling


What follows is in line with the old CMS (Concurrent Mark Sweep) GC because of the memory structure I will discuss. The G1, as far as I know, uses a somehow different structure that is more resilient, but on the end, it will perform a Full Stop the World Garbage Collection, just less frequently.

The standard memory arrangement for a JVM is to divide the memory in three. You have the New memory, the Tenured memory (OldGen) and the Perm (permanent) memory. The Permanent Memory is used to hold bytecode, class definitions and things like that.

The New Memory is itself divided in two: the Eden and the Survivor areas. Eden is where short lived objects are created (that is: when inside a method you create a string, an object is created in Eden and then it is destroyed when the method scope is destroyed). This Eden arrangement is what makes for a ligth weight GC. How does it work? Well, if you come to think about it, when you have a server thread, you wait for requests and when you get one, you usualy call a method to solve it. This method creates everything in Eden and when it terminates, almost every object residing in Eden has been destroyed (because of method scoping). So you can mostly reclaim all of Eden's memory without having to think about handling memory fragments. Empirically, most objects in a JVM have a very short lifespan (or high mortality rate, if you want) and Eden is arranged for that.

Yes, I said almost every object in Eden has been destroyed. That is what the Survivor area is for. Because objects that survive method scoping (ex: the method's result value) will be sent (copied) to Survivor. When Eden is full, then a Minor GC will occur. This Minor GC will do a few things:

  • claim all Eden's dead objects,
  • move all Eden's live objects to the Survivor area (by copying),
  • will stage the other Survivor objects (using generations, akin to being X years old), and
  • send the oldest generation of Survivor objects to Old Generation (in Perm memory) by copying

Note: there are actualy two survivor areas, that are used alternatively to make things more efficient.

If you carefully think about what I've just said, you will realize that an object can be dead to the program (meaning: no longer accessible) but still be using memory and pretty much alive for all intended purposes. Objects are not realy destroyed until they are Garbage Collected. This is what the Mark and Sweep does: decide what objects are reachable (alive) and what objects are not (to destroy them). And yes, this is why you should realy realy think twice before using finalizers, because the finalizer is called when garbage is being collected, not when the object gets out of scope. This keeps the resources being used and slows down the GC process because it cannot just kill the object, it has to give the finalizer a chance to run, thus the object survives the current GC cycle and memory is still occupied.

See the following chart to fully grasp what happens on a Minor GC:


The question is: what happens when Old Gen is full? Does the program halt? Enter the Full Stop the World Garbage Collection.

When this happens, the GC stops all processing in the JVM and starts collecting memory on zombie objects (alive but unreachable). This encompases objects in the Old Gen, as much as objects on Eden, Survivor and things in Perm memory. The Perm is included because some loaded or dinamicaly created classes might no longer be needed and that memory could be claimed.

As for what goes on with the rest of objects, well, they are marked as dead or alive and the alive objects are memory copied to new positions in order to create a large block of unused memory. This copying is expensive on it's own, but what makes the thing even worse is that every pointer (reference) to the moved object must be also updated to point to the new object's address. And this is why the Stop the World, because we can't be changing things while the rest keeps moving (see the section at the end about Azul's Zing JVM).

Note: this pointer fixing can be really taxing on the processor's L2/L3 cache lines, depending on how spread in memory are the pointers to be fixed and how many times the object is pointed to, so a very linked structure (ex: a highly connected graph) will probably slow down the process.

From all this, it follows that the amount of time it takes to do a Full GC is largely dependent on the number of objects still alive that must be copied, which largely depends on the application logic and the amount of memory to scan (AKA: the Old Gen memory size).

Some Caveat


It is custom practice to set the memory sizes by specifying the same value to two parameters. The JVM accepts a Max size and a Start size. If you set Start to less than Max, then the JVM allocates Size and grows when needed. This growing (and shrinking) is expensive, so most people usually sets Max and Start to the same value. For example:

  • the Perm size can be set to 512Mb set with -XX:PermSize=512m -XX:MaxPermSize=512m
  • the Perm total size can be to 8Gb set using -Xms8000m -Xmx8000m
  • the New total size can be set to 128Mb with XX:NewSize=128m -XX:MaxNewSize=128m

How to get information on Garbage Collection


In order to know what is going on at the GC level, you can add the following parameters to your JVM on startup:
-XX:+PrintGCTimeStamps-XX:+PrintTenuringDistribution-XX:+PrintGCDetails-Xloggc:<file path>
Also, you can use the jstat command to see what's going on in real time on a running JVM (it requires jstatd and probably some configuration settings). The command takes the form:
jstat -gcutil <pid> 1s 1000
where 1s is 1 second intervals and 1000 is the number of samples to take. Full jstat command line arguments for Java 5 can be found in this link.

This command is interesting because it shows memory usage in percentages and also the number of Minor and Full GC performed on the system with the total accumulated time used for each type of GC. You can then use this to figure out how long it takes to perform each type of Garbage Collection.

So, how much memory does a JVM need?


This actualy depends, as it always do. There is, of course, a minimum below which your application will not even start. Obviously, you need more than that minimum.

Now, if you are a lucky bastard, then you probably can get away with having NO Full Stop the World GC. Yes, I said before that you can count on death, taxes and a Full GC. So, how is it possible to getting away with no Full GC?

Well, it has to do with the definition of lucky bastard. You fall into that category if you can manage to satisfy these two conditions:

  1. have an application activity cicle that has a very low activity period. For example: you serve customers in just one country and you have near zero activity at 3 am.
  2. have an application that, measured between two adjacent low activity cicles and without a Full GC, consumes less memory than your server's real memory. For example: between today at 3 am and tomorrow at 3 am, your app requires 4 GB total memory and your server has 8 GB phisical memory.
Update: i will clarify on rule 2. Let's say you realize the JVM executes a Full GC every hour and that GC releases 1Gb of OldGen memory every run. Then, on a 24 hour cycle, that JVM will require 24Gb of memory, plus the minimum required memory, plus some extra memory for safety to avoid executing a Full GC. If your server has 48Gb of memory, then allocating, say, 32Gb to the JVM should do the trick.

In other words, lucky means that you can allocate 6 GB of memory to your JVM (which avoids a Full GC during working hours) and you also restart your app at 3 am, before the Old Gen gets full.

Yes! You get away without a Full GC because you kill your JVM before it actualy needs it. This approach is being used by many finnancial institutions (specially high frequency traders) to avoid the latency of a Stop the World GC.

Now, if you are an average mortal, your JVM will actualy stop to GC at some point or another. So what's the size to use for the Old Gen? It depends on your processing requirements and requires trial and error. What you must embrace is the fact that the Full GC cannot be avoided. What you can do, however, is to control for how long the Full GC stops the world. As said, that depends on the total number of object, which in turns depends on the total memory that the GC must scan and copy.

So the value that you must set can actualy be tuned to your needs, but the unecpected thing is that the memory size might have to go down for you to hit your target SLA requirements (ex: all requests must take less than a second to be responded).

I know this is counter intuitive, and frigthening. A few days ago, i was talking to a customer and said that the total configured memory a JVM had should probably go down from 4.5Gb to 3Gb to reduce the Full GC stop time from 2.5 seconds to less than 1s. A few eyebrows were raised and some fearfull looks crossed the room.

Nobody is happy with the idea of reducing the allocated memory to a process, specially in these days of Moore's law when cheap translates into a default of more is better.

But the GC algorithm is time linear on memory size and perhaps it's better to have frequent shorter stops than less frequent but longer stops. Your SLA might call for this.

You see, sometimes less is better.

So, what is your target Memory value?


Well, it's a trial an error, but now you know how to tune the value. You check how frequent is the Full GC and how long it takes and figure out if you are a lucky bastard. If you are, then all set.

If you are not, then the duration of the Full GC should tell you if you have to add or remove Old Gen memory and then tune the size slowly until you get the expected results.

There is also some good news to those lucky bastard wannabes. If you are elastic on the number of CPU/Cores/memory available, then you can get away with no Full GCs, even if your app has significant activity all day long. All you have to do is monitor when your JVMs are approaching the Full GC mark and restart them. An Old Gen at 95% utilization can mark a good time when you recycle the JVM, but this also needs tuning.

Whether you are tuning the NewMem size or the OldGen size, the two most important things to remember are:

  • that the value you set has direct impact on the duration of the GC cycle, and
  • that the type of garbage created depends on your applications, so the values for one app will not necessarily work for a different app.

What about Application Logic?


As i mentioned before, two factors affect the Full GC time. The first one is the memory size which we just covered.

The second factor is Application Logic. More precisely, the reference structure and the pattern of change of Long Lived objects.

The more intricate the links between objects, the more expensive it becomes to move an object around in memory, due to the pointer fixing. So you probably don't want a big mesh of objects referring to each other. A simple reference structure is probably the best choice here, unless you absolutely need something more complex.

The other issue to control is the pattern of change. It will generally be best if your long term objects (ex: cached options and configuration) are mostly stable over long periods of time. The reason being that a destroyed long term object causes a hole in memory that must be reused, which in turn requires moving around other long term objects.

Are we stuck with Full GC?


There is also a bit of other good news, if you can afford to spend some green currency bills.

There is a company called Azul Systems that created a special brand of JVM called Zing. One of the biggest features of Zing is that it has a GC that makes no Full Stop the World GCs.

I haven't used it and I have no relation with Azul Systems, so i can really say much more about it, but it sure sounds cool.

Some extra info


If you want to know more about Garbage Collectors, you can watch this InfoQ presentation by Gil Tene (CTO and co-founder of Azul Systems). It's about Garbage Collectors in general. Again, I have no relation to Azul Systems in any way, but the information on this presentation is very good.

This is enough to be my thirteenth post.

2012-08-27

Harder to kill that cockroaches

About 300 millions of years ago, cockroaches appeared on Earth as a life form. I'm sure the first ones were very primitive (as it is with every life form), but at some point one specific kind of cockroach evolved, and it took over the Earth, because it was the most advanced cockroach that ever existed! This one was extremely resilient to almost everything (even radiation) and will probably outlive us all. Some variations of this insect evolved later (there are 4500 sub-species, 30 of which are related to human habitats, with specific variations by continent).

In 1990, the first web browser came to exist. It was called "World Wide Web". I'm sure it was primitive (compared with what we have today) and it spun a whole lot of new experiments. Soon we had subspecies like NCSA Mosaic (and i'm old enough to have used it), then Netscape (which i've also used), Internet Explorer, Firefox, Safari, Chrome, Opera and many many others. Not sure there are thousands of browser subspecies, but i would say there are at least a hundred, easily.

You probably see were this is going, but in August 2001, the most advanced species of browser appeared on Earth, and it took over the planet. It was Internet Explorer 6. There existed many subspecies (different service pack level, different language, different configuration and enabled options). Some of these subspecies were geography specific. For example, I understand that South Korea has a government mandated authentication protocol that was supported only by IE 6. There is also the chinese version of IE 6 which is used in... China.

IE 6 inherited an adaptation from it's previous primitive sibling: AJAX. It took it hidden in its DNA until it became a real evolutionary advantage, circa 2005, when evolution in behavior began, in the form of innovation. And it was a boom. It learned so many new tricks in a few years that it changed the environment forever, but it was not very well suited to this new environment. Evolution took over again, with Firefox, IE7, Chrome, etc.

When it first appeared, it was the most advanced on Earth... exactly 11 years ago, on August 27, 2001.

But 11 years in the computer industry is like some number of millions of years in evolution, and the damn thing is still lingering on, like a cockroach that refuses to die. When support was finally discontinued, Microsoft itself declared that IE 6 is a fossil, but it's a living fossil.

Not something of the past. A living fossil! Something you would expect in a Hollywood movie like Journey to the Center of the Earth. Or maybe on a Steven Spielberg movie where some scientist recovers a portion of bits from an old damaged CD-ROM and recreates it using some bits from another application.

In an effort to eradicate the thing, Microsoft created the IE6 Countdown site, to track where on the planet this species is still a pest, carrying deadly deseases to developers, businesses and users alike.

Still 6% presence, mostly in China, were it has 21% market share! I have been tracking the evolution of the fight and I'm happy to say that from June 2012 to July 2012, Argentina (where i live) crossed the barrier from 1.2% market share to less than 1% (0.9%, to be precise). Argentina it's now marked green on the map!

But is a die-hard. Even Norway (that crossed the 1% barrier long ago) can't still get rid of it, with 0.1% market share.

So, what can you do? When you see a cockroach in your kitchen don't you kill it? Do the same thing with IE6. When you see someone using it, kill it. Go to the web, search for another (any) browser, download it, install it and feel good you did a good thing for that someone.

IE 6 did good. Changed the environment to something more hospitable. Enabled an explosion. But is now past its extinction time.

This is enough to be my twelfth post.

2012-03-30

DB Connection pool loosing connections (AKA: the not so transparent firewall)

I'm posting this because i've seen this problem many times, specially in corporate environments.

The symptom

You have an application (usually a java app running on an app container such as tomcat) that connects to a (any) database. You change from using a simple "connect to the db when you need it" to "use a connection pool to eliminate connection setup time".

You immediately rollback the connection pool change because the connections seems to die after some time (usually low usage or night time), making your application unresponsive.

The explanation

Well, there are many possible causes for the symptom, but one of the most common causes i've seen (specially in corporate environments) is due to the widespread use of transparent firewalls.

As a side note: be wary of anything that calls itself a transparent something, such as transparent firewalls or transparent proxies because they are, at best, translucent.

To understand the problem, we must first understand that the TCP stack closes a connection after it interchanges a number of packets with it's peer TCP layer. The message flow is complex because a TCP connection can be half closed, meaning one of the peers is done with it but the other is still sending. In order to close a connection, one of the peers can send a message with the FIN flag set (which triggers the FIN-WAIT statuses you can see with the netstat command). The FIN-WAIT status then progresses to FIN-WAIT-2, then to CLOSING and then to CLOSED.

Another alternative to connection closing is for one peer to send a TCP RST (reset) packet, indicating that the whole connection is in failure mode and shall be terminated.

TCP keep-alive was invented so that long lived connections could be kept alive. This includes exchanging ACK (acknowledge) packets between peers every now and then (usually configurable at the kernel level).

When keeping alive a connection, if a host doesn't hears from it's peer, then a lengthly series of attempts to reach the peer begins (we'll come back to this later). Upon failure, this series ends with the connection in the TIME-WAIT status that occupies a connection table slot for quite some time.

Enter the corporate firewall.

The most common usage of a corporate firewall is to actually do connection filtering. This means: telling what host can connect to what other host+port. I'm not really sure why a firewall is used for this, because most basic routers (and even switches) can do this kind of thing, albeit more in an NAT kind of way, not in an ip-hiding way.

Connection filtering works something like this:

  • When a connection is attempted, the firewall looks at the filtering rules to check if the 4-uple is in the list of allowed connections.
  • If it's not on the list, then one of the following happens:
    • an RST packet is sent to the source host, blocking the establishment of the connection. Not the best option, as it informs the source host that the remote ip does exists but doesn't like the port.
    • nothing is sent, letting the remote host retry until it timeouts. This is good, as the source host does not know the remote ip is valid, preventing address/port network scans.
  • If the 4-uple is on the list, then both of the following happen:
    • An entry representing the 4-uple is created in an in-memory table inside the firewall software.
    • The connection setup packet is forwarded to the destination host (oftentimes with some IP address changing).

So far so good.

Now, for a reason that escapes me, many (most?) corporate firewalls don't like low traffic long lived TCP connections. They don't like these connections badly enough to go to the extreme of detecting TCP keep alives in order to be able to euthanize these connections.

Low traffic long lived connections in corporate environments are darwinianly selected out by corporate firewalls. This natural selection is what's causing trouble in your environment.

The real problem is that the connections are not being terminated, just being euthanized. Yes, there is a big difference.

In a connection termination scenario, the firewall has to do two things:

  1. Send each real peer a TCP RST packet. This gives the peers a notification of the massacre. This notification enables the host's kernel to return an error to the application which will notice the connection is closed and be able to re-connect.
  2. Drop the 4-uple in the firewall valid connections table (thus blocking all following traffic).

But the connection is euthanized. This means the firewall usually does just step number 2.

The peers, not knowing about the transparent firewall, enter the lengthly process of trying to reach out each other. A process that is doomed to fail because the firewall is not forwarding the traffic.

Eventually, the (kernel) peers decide to give up and close the connection, letting the application know with an error. The key word in the previous sentence being eventually. In the mean time, the connections are zombies inside the kernel and the application is not notified until it's too late.

The Solution

The good news is that you can probably fix this problem without trying to let the people running the firewalls understand they are causing trouble and change defaults (after all, this is a silo'd corporate environment, what are your chances of effecting change on a different department?).

If you are using a standard connection pool, there are two options you have to set to avoid triggering the euthanasia:


  • Set the timeout (or keep alive) connection time to a lower value than the firewall's idle value. If the guys running the firewall don't tell you what this value is, you can always hand test it.
  • Set the option to do an active keep alive. This will effectively send a command to the database causing real traffic to go thru the connection. This involves a command without side effects (such as select 1 from dual, select current_time or something like that).

If you are not using a connection pool but an ad-hoc application-level protocol, you can always create a nop (no operation command) that responds some data.


This is enough to be my eleventh post.

2012-03-08

Will extreme poverty ever end? Nuru International

Hi,

after watching a Google Tech Talk by Jake Harriman, founder and CEO of Nuru International (nuruinternational.org), titled Will Extreme Poverty Ever End?, three things got into my mind.

The very first one was to actually go and make a monthly donation. Go ahead and do so yourself. Not only you will never die of a severe case of donation, it also increases your overall happiness, which reduces stress. So go and do something good for others and yourself (or yourself and others).

The second thing i couldn't stop, is wondering about this guy. A US Marine that went to Afghanistan (something that some people, including me, considers to be one of those bad things the US do abroad). This guy decided to keep fighting terrorism on very different front lines: those of NGOs and humanitarian relief.

I can't help to think that when a soldier decides that she or he will be more effective fighting terrorism by eradicating poverty, then a lot of good things are going on in the world. Many many more good things that we would assume if we just listen to mainstream news organizations, or old folks saying things used to be better.

While i never had the paralysis problem when facing a decision of letting go of things and start anew, it is very telling that changes like the one Jake Harriman did are not only possible, but feasible.

Bold move.

The third thing that got into my mind is innovation. You can tell that this guy took a lot of bagage from the Marines (well, you can't stop being who you are) and applied it to the field of humanitarian relief. Nuru works like a contractor. They actually don't try to reinvent the weel, but to figure out the most effective weel for each type of task. Nuru partners with other NGOs so that both NGOs can better fulfill their respective missions. And Nuru is trying to actually measure things to figure out what works and what doesn't, to try new things and fail fast.

A very agile style.

I couldn't help to compare this approach with the Record Labels's and Entertainment Industry's. We recently were bombarded with the whole SOPA/PIPA debate (even here in Argentina was all over the news). These industries don't try new things. They don't even like others to try new things. In the case of Hollywood, they even disregard they own history. Independent studios relocated to the US west coast because of the many patents and laws on the east coast were stifling innovation and new business models. And these independent studios grew to be Hollywood.

So, instead of collaborating with Napster, Labels shut it down. Neither got the chance to better fulfill their respective missions.

Nothing like contrast.

This is enough to be my tenth post.