2009-12-24

The flavors of Java concurrency control

This second post took longer than i expected. I was involved in a witch hunt in my day job for a few weeks and then everything else started piling up. The witch hunt got me into reviewing a lot of Java code running inside an ESB to find any kind of performance hog i could spot with the naked eye. Yes, i know there are profiles, but good luck trying to run one of those things in a Bank's production environment.

During the review, i found a few possible enhancements, the most important one of those is what started this post.

I will discuss two methods of concurrency control available in Java 1.5, the synchronized modifier and the classes in the java.util.concurrent.atomic package, then some conclusions. I will not use java specific objects when trying to explain some language-neutral feature (a comment with the word "LN" will be added to such lines).

The synchronized modifier

This modifier provides a (no-brainer) way to implement mutual exclusion. You prepend the modifier to a method and then that method becomes guarded by a mutex. What comes to our minds when we read a sentence like the one before is that, for the following code:

int someValue = 0;

synchronized int getNextValue() {
    return(someValue++);
}

the compiler will generate code looking like:

int someValue = 0;
Mutex getNextValue_mutex = new Mutex();

int getNextValue() {
    getNextValue_mutex.getLock();       // LN
    int tempValue = someValue++;
    getNextValue_mutex.releaseLock();   // LN
    return(tempValue);
};

It is reasonable to expect that the Mutex.getLock allows only one thread to get the lock, while queueing other threads in a FIFO way; while the Mutex.releaseLock takes the first queued thread (if any) and schedules it for execution.

And that is what the Java synchronized modifier does. Well, not quite... it turns out that this modifier does a little more than just guard the method with a mutex (see IBM's article Threading lightly, Part 1: Synchronization is not the enemy.)

The truth is that Java's synchronized modifier does flush the processor's data cache on entry and commits it to memory on exit. So the real code executed will look more like:

int someValue = 0;
Mutex getNextValue_mutex = new Mutex();

int getNextValue() {
    getNextValue_mutex.getLock();       // LN
    Native.processorDataCacheFlush();   // LN
    int tempValue = someValue++;
    Native.processorDataCacheCommit();  // LN
    getNextValue_mutex.releaseLock();   // LN
    return(tempValue);
};

Sidenote: the Native class is there representing some low-level methods to handle the processor cache (written either in C or Assembly and completely platform dependent). Also, the whole cache needs not to be discarded, but figuring out what can be kept is not an easy problem to solve.

Two things to notice here:

  • If you come to think about this, the synchronized modifier will work only if the processor's data cache is handled this way, and
  • I suspect that Microsoft's .Net synchronized modifier works the same way.

The data cache must be flushed on entry because you really don't know the status of precached data the processor might have. It is possible that on entry to the method, the processor's cached data was fetched before another processor executed the same method, rendering the cached data unusable.

So yes, you must do that nasty thing to the processor's cache. Nasty being the right word here. Just consider that Intel, AMD and every other manufacturer devotes a very large number of transistors on every chip to data caches and predictive data pre-fetch.

There is a very important reason for this expenditure in transistors. Your Gb sized main memory is slow compared to the motherboard's bus frequency which is in turn very slow compared to the CPU's clock frequency. For example, DDR2 SDRAM has it's own clock running at half the speed of the motherboard's bus clock frequency which runs a fraction of speed of CPU's clock frequency. Some set of common values are: 266 Mhz / 533 Mhz / 2.1 Ghz (memory / bus / CPU).

To keep the CPU busy, the pre-fetch logic scans ahead the instruction stream to guarantee that the data stored in memory is at hand when the processor needs it. Consider that flushing the processor's data cache throws away all the good work the pre-fetch logic has been doing, meaning that the CPU has to sit idle until the pre-fetch has a chance to start refilling the caches.

This has a terrible effect on performance. Assuming that the cache flush affects only the entries of your program your other threads suffer from it. Imagine what the effect will be if the flush is not selective and discards also cached data from the operating system, services, etc.

There are two more things to pounder. The first one is that this cache flushing and committing should only matter in multiprocessor systems (like your servers.) That is, up until now, in a single multi-core processor system all cores in the processor share the same cache, yet i think that abundance of transistors is or will be changing that soon. I have not been able to confirm what the JVM does in these cases (i.e: if it really optimizes the single processor case by not doing the cache dumping.) But i don't think that's very important... How many people buy a single processor server these days?

The second thing to consider is that Intel and AMD are both talking about cache coherency and invalidation protocols for inter-processor communication. The idea is that if you have one processor changing a memory position, then it will broadcast that fact to other processors, so rather than invalidating their whole caches, the other processors can just drop the outdated cache entry. I think there are already some Intel processors able to do that, but i'm not sure of this.

To recap, the synchronized modifier uses a mutex that provides for MUTual EXclussion and guarantees that threads waiting for the lock are given the lock in the order they requested it. This also requires processor's data cache flushes and commits and, depending on the platform you are running, it also requires expensive context switches to kernel mode (to operate on the mutex itself.) Threads are de-scheduled when they can't immediately get the lock so they don't consume CPU if not able to proceed.

The atomic package

The atomic package (java.util.concurrent.atomic) was added on Java 5. This package provides a set of thread-safe read/write atomic scalar values equivalents (boolean, integer, long, references and array of these). The big thing behind these replacements is that they are lock free (no mutexes) and thread-safe.

The Synchronized modifier example we've seen in the begin of the previous section could be written as:

AtomicInteger someValue = new AtomicInteger();


int getNextValue() {
    return(someValue.getAndIncrement());
}

Notice that the synchronized word disappeared and, being thread safe, no thread will get a partially updated value.

How is this possible? Thanks to a technique called Spin Lock. A Spin Lock is a method that, AFAIK, was devised first for intra-operating system synchronization in the face of multiple CPUs. If you had something resembling an Operating Systems course, you'll remember the Test and Set synchronization method (if you don't, then you can read about it on Wikipedia.) A Spin Lock is a loop around a Test and Set operation. This will make a program to keep looping (consuming CPU) until the lock is obtained. If you want to know more about Spin Locks (and many other synchronization options), check reference [1].

On Intel architectures, there is a single processor instruction that exchanges two data values. One variant of this instruction (taking a register and a memory address) can be used to implement an atomic exchange or, put in other words, a Test and Set. The instruction is atomic because it automatically places a low level lock on the motherboard's bus so that no other component can access the system while the lock is held (and as memory is connected to the bus, only the lock holder can use it).

Sidenote: this instruction can be traced on Intel architectures back to the 8086 and 8088 processors, where it required a special prefix to place a lock on the bus. In case you are thinking about it, the answer is: yes! As far as 1978, Intel processors have been multiprocessing-ready.

Now, what about the processor's data cache and the exchange instruction? Well, as you might have guessed, for this to work, the Test And Set must be cache ignorant. That is, this particular instruction causes the pre-fetch logic to ignore it, because the pre-fetched data could be altered after fetch but before use. Also, it is slow compared to other instructions, because it has to access memory right on the spot. This "slow" means that the whole system is penalized by a few bus clock cycles. The reality is of course a bit more complicated, because of dual gate memories, multiple buses, etc. But you should get the picture.

There is a sister package called java.util.concurrent, that provides some data structures (HashMap, Array, ThreadPools, etc) that use the Spin Lock mechanism to provide concurrent data reads and writes.

To recap, the Spin Lock method is implemented always at the process level (does not require the operating system's help.) It does not require the processor's data cache to be flushed, but it slows down the whole system (by a few clock cycles because of cache bypassing.) Threads are not de-scheduled when they are unable to get the lock. They are allowed to keep looping (consuming CPU) until they get the lock or the assigned CPU slot is fully consumed. This makes impossible to guarantee that threads get the lock in any particular order. It also means that the things that you can guard by use of a Spin Lock should be really fast and non-blocking (ex: no file or network I/O).

Why are there two methods?

Well, that's easy: these are two different tools that are used for two similar yet different tasks, in different scenarios. You just have to know when to use which one.

The definitive answer is of course dependent on the full description of the problem you are solving, but the following list of questions is more or less my recipe to decide:

  • Do you need a strict FIFO access to the shared resource? If yes, you must use the synchronized method. Example: if e-bay receives two bids for the same price, they want the first incoming one to win.
  • Does the code to be executed is fast and small and has no I/O or other blocking operation? If yes, then you can go with the Spin Lock. Remember that the Spin Lock keeps consuming CPU while the lock is not acquired.
  • Are you optimistic or pessimistic about the amount of contention on the resource? If you are pessimistic, then go with the synchronized method. If you are optimistic go with the Spin Lock variant. Here optimistic means that there will be low contention on the resource, so the busy loop of the Spin Lock will outperform the context switch (for mutex) and thread de-schedule of the synchronized case.

Besides that, i think it's safe to assume that if you need a data structure that doesn't change too often and you want maximum possible concurrency, the concurrent data structures provided by java.util.concurrent are in general a good option, but...

In a congested (CPU) system, the Spin Lock's "busy loop" could make the congestion even worse by wasting more CPU. Ideally a Spin Lock should try for some amount of time and then yield the CPU to allow for the lock holder to process and the lock to be released, but this can cause the thread to be delayed beyond what is acceptable under high CPU loads.

Conclusion

You should keep these two concurrency control options at hand in your tool-belt, but you should always start by using the most important of the tools you have at your disposal (AKA your brain) and don't get fooled.

One last example. A few months back, there was a different system suffering performance issues. One of the problems turned out to be that a one way (audit and trace) message was put on a queue to be processed in background. The decision to do this was sound, because the extra processing to be done could be made asynchronously in a different machine, making the foreground task faster, without sacrificing the audit functionality.

How was that a performance problem? Well, queues are complex data structures and as such, require the put and get operations to be synchronous (locking the queue.) This case was even worse, as the queue was part of a clustered ESB having multiple producers and consumers, on different machines.

While the system had an average load, everything went peachy, but when the load started to peak, the overhead caused by locks on the queue put/get operations began to be noticeable, up to the point where performance was hit.

Synchronization is all around you, even if you don't see it.

Enough for a second post.

UPDATE: many thanks to Alejo Sanchez for reviewing early drafts of this post.

References

[1] John M. Mellor-Crummey, Michael L. Scott, Algorithms for scalable synchronization on shared-memory multiprocessors, ACM Transactions on Computer Systems (TOCS) Volume 9 ,  Issue 1  (Feb. 1991) Pages: 21 - 65. You can download it for free from SiteSeer at Penn State University.

2009-11-08

printf("Hello World!\n");

Not much else to say, isn't it?


I mean, if you are a programmer, then you get the thing. If not, you are probably in the wrong place.


Enough for a first post.