*NULL Pointer Assign

Time flies!

2013-12-31T18:59:00.002-03:00

Another year... It seems like i've just blinked.

I hope you have a very good new year.

Have you make any donations this year?

For my part, i continued my monthly donations to the WFP (held by United Nations, www.wfp.org) and to Nuru International (www.nuruinternational.org). The first program is to bring relief, the second with the intention to really solve the problem. Let's hope.

As for my end-of-year donations, this time i donated to:

Wikimedia Foundation. Donation page in english and spanish. They accept donations in ARS (Peso Argentino)~~, which is a good one to avoid the 35% extra charge for USD expenses~~. Update: the transaction proceeds actually in USD.
SETI Institute. Donation page.
Unicef Argentina. I made the donation thru my bank.

For my dismay, i tried to donate to Cruz Roja Argentina (Argentina's Red Cross chapter). If you can believe it, the donation page requests you put your credit card information, but the page is not secure (plain old HTTP instead of HTTPS). This is a big WTF! I contacted them explaining the issue and requested a different means to donate. Lets see how this goes.
Update: they have responded me, indicating that they do care about that and they use a secure payment processor (they page is still HTTP). I requested an account number to donate safely.

Well, i hope you have a happy new year, electric power if you live in Buenos Aires (another big WTF this year) and an open wallet to donate some money.

This is enough to be my seventeenth post.

Problems and Obstacles

2013-11-17T18:30:00.000-03:00

I was recently talking with colleagues, discussing an issue and, unexpectedly, i asked: "is this issue a problem or an obstacle?"

While i spoke the words, i remember thinking that i needed to devote some brain cycles to think about it. Hence this post.

The first thing to understand is we usually use the terms interchangeably, but they mean different things (see Wiktionary's definition for problem and obstacle).

A Problem is something active, that you interact with and that can even change you. An Obstacle is passive, something that you have to circumvent. Unnecessary.

Being a natural engineer of sorts, i like Problems. I don't very much like obstacles. But both require energy.

Many startups become successes by solving Problems, but probably many more thrive by solving one type of Obstacle for their customers (the startup's problem is then to efficiently and scalably solve said obstacle).

Normally, companies build Obstacles to enforce specific behaviours into their workers. Yet, the worst expression of Obstacle comes to be when companies turn this behaviour-enforcing Obstacles into cysts that plague work. Bureaucracies are just cysted piles of Obstacles.

As a worker, you normally should scale up Problems to your boss and handle Obstacles yourself, but this is not always possible. Sometimes your boss is an Obstacle on it's own. Sometimes the company you work for loves to make Problems out of Obstacles, so that no employee can ever make any decision. Sometimes you are the Obstacle... or Problem.

Yes. We all sometimes create Obstacles and Problems for ourselves or others. We usually do so because of our personalities, fears or bad habits. The worst self-delusional use of Obstacles is to hide Problems (circumventing something is a nice way of not examining that something). Another annoying personality trait is to turn Obstacles into Problems (aka: the resident renegade).

For some reason that escapes me, Obstacles tend to rapidly become part of the Culture, while Problems tend to remain ignored. Yet, Problems usually hide a huge opportunity for economic, personal and/or social growth.

Finally, a life with only Obstacles makes you feel worthless; a life with only Problems makes you feel overwhelmed.

Learn to tell Problems from Obstacles. May your Obstacles be small and your Problems sparse. Fill the rest with satisfaction or happiness.

This is enough to be my sixteenth post.

Adiós Fibertel, y Fuck You!

2013-02-27T15:58:00.001-03:00

Si. Luego de 4 años y unos pocos días, estoy cambiando de proveedor. No más Fibertel. Ahora es Telecentro. Porqué? Bueno, lo que sigue explica el motivo, pero lo más importante viene después, porque creo que hay varias lecciones que aplican a los negocios y hay que tener siempre presente.

Breve Historia

Unos días antes del 2 de Octubre del 2012 (hace casi 5 meses), mi servicio de Internet murió. Las luces del modem iban y venían, pero la más importante (Cable) estaba apagada. Mi modem no lograba negociar con su contraparte. Llamo al 0-810-algo y me asignan una visita de técnico para el 2 de Octubre. Un fastidio. Sin internet? Puedo vivir sin gas (gracias delivery), sin TV (no hay gran cosa para ver?) pero no sin Internet. Uff. Bueno.

Orden de trabajo 6810763. Vino el técnico y revisó todo. Diagnóstico: el problema es que no llega suficiente señal a la caja distribuidora del edificio. Tienen que cambiar el cable. Uh, que? Y cuanto tarda? No, lo hacen dentro de las 72 horas. Y tengo que hacer algo? No, es un tema global del edificio así que la empresa se organiza para hacerlo. Ah, bueno.

Dos días después, el servicio se restauró. Bien!

Y viví feliz, hasta el 28 de Diciembre del 2012 (si, justo el día del inocente). Me lleva el diablo! (28-Dic-2012 fue viernes, con 31 y 1 lunes y martes, feriados). Sabía lo que significaba: 1 semana sin internet.

Llamé a servicio técnico. Me dieron visita para el 4 de Enero. No sabían si era algo específico, general de la zona o estaban trabajando en algún punto, así que podía arreglarse solo. El 2 de Enero, como para comenzar bien el año, la cosa se había arreglado. Cuando el servicio automático para confirmar las visitas llamó, la cancelé.

Cerca del 20 de Enero tuve otro corte (otro sábado), que duró hasta el domingo.

El último acto comenzó el 9 de Febrero. El servicio se fue nuevamente, otra vez con un fin de semana largo por delante (11 y 12 de carnaval).

Para servicio técnico digite 1. Visita para el 12-Feb, Orden de Trabajo 26849786. Y esta vez, pensé, aún si se arregla no voy a cancelar la visita. Me dije que algún problema debía haber, mientras comenzaba a sospechar que el arreglo de Octubre nunca había ocurrido.

Y así fue que vino el técnico y revisó todo. Diagnóstico: el problema es que no llega suficiente señal a la caja distribuidora del edificio. Tienen que cambiar el cable. Si, mismo diagnóstico que el de Octubre. Lo sabía!

Ya no tan feliz, pero diciéndome todo el mundo se equivoca, llamé al servicio técnico el miércoles 13 de Febrero. Orden de trabajo 35744811. Vienen el viernes 15 a reinstalar la subida entre 8 y 18 (bueno, es algo que normalmente hacen con los encargados, que están todo el día). Les aclaré que toquen en mi departamento ya que el encargado está de vacaciones, pero que yo tenía temporariamente las llaves para la terraza.

El jueves 14, como relojito, el sistema automático llamó para confirmar la visita. "Si confirma la visita presione 1". Claramente, presioné 1.

Viernes 15 de Febrero de 2013. El gran día! Me levanté 7:30 (inaudito para mis costumbres de búho). Y me puse a leer mientras esperaba. Sin Internet no se puede trabajar ni hay mucho que hacer. La caja boba se volvió virtual ;-)

Alrededor de 13:30 decidí salir 20 minutos a comprar algo de comida, así que, precavido el tipo, llamé para preguntar si tenían idea de a que hora iba a venir el técnico, ya que tenía que salir por unos minutos a hacer una compra.

No. La visita me figura como anulada. Anulada? Porqué anulada. Yo la confirmé ayer a la máquina que me llamó (se ve que sólo les importa su tiempo, no el de los clientes). Si, no sabría decirle porqué está anulada. No me figura el motivo. Pero le estoy tomando el reclamo y se tiene que resolver en 72 horas así que lo van a llamar por esto. El reclamo es el número 36624236.

Mientras transcurría la conversación, y ya sin tan buen espíritu, pensaba que una de tres cosas estaban pasando, estos tipos: 1) no son capaces de resolver el problema, 2) no quieren resolver el problema o 3) me están tomando por boludo.

Resolución JLC-98902342: 72 horas para resolver el problema, caso contrario, resolver personalmente el asunto (en criollo, Telecentro).

Así que esperé. Semana con un miércoles feriado, pero 4 días hábiles. Tiempo más que suficiente (96 hs > 72 hs). Pero no me llamaron...

Lunes 25 de Febrero. Sonó el despertador. Arriba. Al baño. A la cocina. A chequear el mail. No hay servicio.

Ejecutando resolución JLC-98902342:

Se ha comunicado con el nuevo centro de atención al cliente de Telecentro.
Se ha comunicado con Cablevisión - Fibertel. Presione 1 si desea... Hola, si, llamo para dar la baja del servicio Fibertel. Lo comunico con el centro de atención especializado... Mirá, entiendo que tengas que decir esto y aquello, pero por favor no des más vueltas, ya pagué el costo de instalación de Telecentro así que no hay vuelta atrás, tomá la baja. ¿Está consciente de que va a tener una menor calidad de servicio? Uh, prefiero menor calidad de servicio con Telecentro a un servicio que no tengo con Fibertel. Bueno señor, le confirmo la baja con el número de trámite 38087865. Esto es a partir del primero de marzo, verdad? A partir del 31 de marzo, se le va a cobrar el mes de marzo. Yo pensé... y dije OK, con un cierto entusiasmo.

27 de Febrero, 9:40. Buenos días señor lo llamo del centro de gestiones especiales de Cablevisión. Quería preguntarle el motivo de la baja, blah blah. Si, ustedes saben el motivo. Revisá la historia de tickes y de como me dejaron de garpe y que estoy sin servicio. Gracias! Hasta luego! Y corté. La verdad, ni siquiera me preguntó si tenía posibilidad de hablar en ese momento. Si, porque obviamente yo estoy al servicio de Fibertel.

Adiós Fibertel, y FUCK YOU!!!!

Defensa del Consumidor, nos vemos pronto!

Defensa del Consumidor? Bueno si. Me refiero a ese OK con entusiasmo. Es el asmo equivocado, fue más bien sarcasmo. Así que no sólo me querés mojar la oreja, también querés escupirme en el ojo. Si claro, uno de estos días!

Hasta ese momento, todo el asunto era como un animal muerto en la ruta. Lo pasás y ya no pensás en el pobre bicho aplastado en el asfalto. No tiene mucho sentido ir por la vida buscando venganza por cada estupidez que ocurre...

Pero pagar por marzo de un servicio que estoy dando de baja porque básicamente no tengo servicio? No fucking way! Porque eso de pagar marzo es probablemente alguna cosa administrativa interna de Fibertel con su sistema de facturación.

Y ya que estamos, me van a devolver los días que no tuve servicio. Y me van a pagar las 5:30 horas que me dejaron de garpe el 15 de Febrero.

Y por cierto, el 11 de Febrero compré un modem 3G y le puse 60 mangos, porque no tenía servicio y tenía que laburar. Y el lunes 25 tuve que volver a cargar 50 mangos al bendito modem, para laburar. Eso también van a tener que pagarlo.

Estoy bastante seguro de que para resolver los casos ante Defensa del Consumidor tienen que poner a un abogado, probablemente junior, con un costo de unos $ 200 la hora. Y diría además que cada uno de esos casos les requiere entre 10 y 15 horas de trabajo, así que no creo que puedan safar de un costo de entre $ 2000 y $ 3000, por el sólo hecho de que yo vaya a presentar una denuncia.

Y es que, a estas alturas, además de tener razón en estar enojado y en el reclamo por faltas que voy a hacer, también quiero un poco de sangre de Fibertel arriba de la mesa.

Viejo dicho Klingon: la venganza es un plato que se sirve mejor frío.

¿Porqué estoy contando todo esto?

Como dije al principio, hay varias cosas de esta historia que tienen aplicación directa en como se manejan los negocios, a saber:

1- La paciencia tiene un límite

Cuando un cliente te está dando oportunidades de hacer bien las cosas, no significa que te va a seguir dando oportunidades para siempre. En algún momento tenés que enderezarte y resolver el problema.

Yo se perfectamente que es muy probable que tenga una menor calidad de servicio en Telecentro. De hecho, creo que Fibertel es uno de los mejores ISP que hay en Argentina.

Nota: esa sensación rara que tuviste mientras leías esa última oración es Customer Loyalty quemándose y largando un humo negro que contamina el medio ambiente y daña la capa de ozono.

Cuidá a tus clientes, entendé lo que necesitan y buscá la forma de dárselos (no siempre tiene que costarte algo).

Y tus clientes está usado en un sentido amplio de la palabra. Si trabajás en relación de dependencia, tu jefe es tu cliente. También tiene que estar contento/a (así como tu jefe tiene que tenerte contento/a). Tu valor de mercado está en relación a que tu jefe te siga dando negocio, mejores y diferentes negocios. Tu valor de mercado es tu sueldo. Tu proyección como empresa es tu carrera, y no podés tener una carrera si tus jefes no te dan oportunidades. Eso requiere Customer Loyalty.

2- Customer Loyalty es un recurso no renovable

Crear y hacer crecer una relación comercial (sea entre empresas, por contrato, consultoría o relación de dependencia) es algo que lleva mucho tiempo, pero el daño irreparable se hace rápido.

Justo como en este caso. Y si leiste con atención la pequeña opereta de historia, habrás notado que el punto de inflexión fue el 15 de Febrero. Ese día, cuando crearon un ticket y me dijeron que tenía que resolverse en 72 horas y que me iban a llamar. Justo en ese momento. Apostaron todo a cara o cruz. La resolución JLC-98902342 es mi inconsciente percibiendo esa apuesta y aceptándola.

Podrían haberme llamado para decir cualquier cosa (ej: estuvimos, tocamos el portero y nadie atendió, que se yo, podría haber funcionado mal). Cualquier cosa habría entendido.

Pero no llamaron. Y perdieron la apuesta. Fue en un instante.

3- Hacé bien las cuentas

Las empresas (especialmente las grandes), normalmente clasifican a sus clientes en diferentes conjuntos, que van desde menor valor hasta premium. Ese valor (long term value o valor a largo plazo) tiene que ver con el prospecto de ganancias que la empresa puede obtener de ese cliente (y no te sientas tan ofendido/a, porque a tus amigos/as también los clasificas así y por algunos harías cosas que no harías por otros).

En mi caso específico con Fibertel, yo no creo haber estado en el segmento super-hiper-top, pero estoy seguro que estaba de la media para arriba. Con Internet 6Mb, DVR (graba), TV-HD + Pack HBO y cada tanto algún consumo Pay Per View. Bastante lejos del cliente que sólo tiene internet y que llama para quejarse y obtener un descuento cada vez que hay un aumento.

Además de eso, con 4 años de antigüedad, así que los costos de instalación inicial y del DVR (que imagino será de unos U$S 400) ya los tenían más que amortizados.

Estimo que Fibertel debía estar obteniendo entre $ 150 y $ 200 de ganancia neta mensual de los servicios que me cobraba. Y considerando el tipo de cliente que soy (que busca más su comodidad aunque le cueste un poco más, lo cual implica poca probabilidad de cambio de proveedor), calculo que el largo plazo estaría en el rango de 5 a 10 años.

Así que (y esto es otra estimación) mi valor de largo plazo como cliente para Fibertel podría calcularse como ($150..200) * 12 * (5..10 años), es decir, entre $ 9000 y $ 24000 en los próximos años. Parece poco, pero multiplicá esto por cientos de miles.

Esa es la ganancia que Fibertel deja de tener por el simple hecho de que yo me pase a la competencia.

Pero hay un poco más todavía. El hecho de que yo vaya a presentar una denuncia ante Defensa del Consumidor tiene un costo para ellos, como ya mencioné.

Es así que lo más probable es que perderme como cliente además les implique perder las ganancias que obtuvieron de mi en el último año!

Pero espere! Para hacer esta oferta aún más imperdible: como dije, creo que es posible que en algún momento deje Telecentro y vuelva a Fibertel. Pero eso también tiene un costo. Que se llama Costo de Adquisición de Cliente y que la empresa te paga en la forma de promoción (o descuento) por darte de alta. Si es que alguna vez vuelvo a contratar Fibertel, la empresa también va a tener que pagar ese costo.

Edición pre-publicación: posiblemente pienses que mis números están equivocados, y probablemente lo están. A veces, antes de publicar un post se lo envío a algún amigo para que lo lea y me de su opinión. Bien, en los 50 minutos que llevó obtener esa opinión, volvieron a llamarme del Centro de Gestiones Especiales de Fibertel. Así que sí, creo que mis números son demasiado conservadores.

Así que cuando quieras perder un cliente (o mejor, valorar apropiadamente un cliente), hacé bien las cuentas!

Conclusión

Fue bueno mientras duró.

Suficiente para ser mi decimoquinto post (y primero en español, go figure!).

Another year is ending. Did you make a donation?

2012-12-24T20:13:00.000-03:00

Another year. Time goes by faster and faster.

I usually wrap up the year by making some donations.

This time i donated to:

Wikimedia Foundation. The donation page is here (english) or here (spanish).
The SETI Institute. The donation page is here.

Also, all this year i've been making monthly donations to the World Food Program (english or spanish) and Nuru International (here).

So, end this year doing something for others. That is money well spent and it actually makes you feel better about yourself.

This is enough to be my fourteenth post.

How much memory does a JVM need?

2012-12-24T19:58:00.001-03:00

Update: fixed some typos pointed out by Ezequiel and Lecko.

At this point in IT affairs, we should change the old US saying to go something like "Nothing is certain but death, taxes and a JVM Full GC".

Because, no matter what, one thing you can count on is that a JVM will eventualy perform a Full Stop the World Garbage Collection. This is true with Concurrent Mark and Sweep (-XX:+UseParNewGC -XX:+UseConcMarkSweepGC), and with the relatively new G1 garbage collector on Java 6 (-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC).

Hence, the rigth thing to do is to give the JVM as much memory as you can afford. Right? Well, not necessarily.

Some ground leveling

What follows is in line with the old CMS (Concurrent Mark Sweep) GC because of the memory structure I will discuss. The G1, as far as I know, uses a somehow different structure that is more resilient, but on the end, it will perform a Full Stop the World Garbage Collection, just less frequently.

The standard memory arrangement for a JVM is to divide the memory in three. You have the New memory, the Tenured memory (OldGen) and the Perm (permanent) memory. The Permanent Memory is used to hold bytecode, class definitions and things like that.

The New Memory is itself divided in two: the Eden and the Survivor areas. Eden is where short lived objects are created (that is: when inside a method you create a string, an object is created in Eden and then it is destroyed when the method scope is destroyed). This Eden arrangement is what makes for a ligth weight GC. How does it work? Well, if you come to think about it, when you have a server thread, you wait for requests and when you get one, you usualy call a method to solve it. This method creates everything in Eden and when it terminates, almost every object residing in Eden has been destroyed (because of method scoping). So you can mostly reclaim all of Eden's memory without having to think about handling memory fragments. Empirically, most objects in a JVM have a very short lifespan (or high mortality rate, if you want) and Eden is arranged for that.

Yes, I said almost every object in Eden has been destroyed. That is what the Survivor area is for. Because objects that survive method scoping (ex: the method's result value) will be sent (copied) to Survivor. When Eden is full, then a Minor GC will occur. This Minor GC will do a few things:

claim all Eden's dead objects,
move all Eden's live objects to the Survivor area (by copying),
will stage the other Survivor objects (using generations, akin to being X years old), and
send the oldest generation of Survivor objects to Old Generation (in Perm memory) by copying

Note: there are actualy two survivor areas, that are used alternatively to make things more efficient.

If you carefully think about what I've just said, you will realize that an object can be dead to the program (meaning: no longer accessible) but still be using memory and pretty much alive for all intended purposes. Objects are not realy destroyed until they are Garbage Collected. This is what the Mark and Sweep does: decide what objects are reachable (alive) and what objects are not (to destroy them). And yes, this is why you should realy realy think twice before using finalizers, because the finalizer is called when garbage is being collected, not when the object gets out of scope. This keeps the resources being used and slows down the GC process because it cannot just kill the object, it has to give the finalizer a chance to run, thus the object survives the current GC cycle and memory is still occupied.

See the following chart to fully grasp what happens on a Minor GC:

The question is: what happens when Old Gen is full? Does the program halt? Enter the Full Stop the World Garbage Collection.

When this happens, the GC stops all processing in the JVM and starts collecting memory on zombie objects (alive but unreachable). This encompases objects in the Old Gen, as much as objects on Eden, Survivor and things in Perm memory. The Perm is included because some loaded or dinamicaly created classes might no longer be needed and that memory could be claimed.

As for what goes on with the rest of objects, well, they are marked as dead or alive and the alive objects are memory copied to new positions in order to create a large block of unused memory. This copying is expensive on it's own, but what makes the thing even worse is that every pointer (reference) to the moved object must be also updated to point to the new object's address. And this is why the Stop the World, because we can't be changing things while the rest keeps moving (see the section at the end about Azul's Zing JVM).

Note: this pointer fixing can be really taxing on the processor's L2/L3 cache lines, depending on how spread in memory are the pointers to be fixed and how many times the object is pointed to, so a very linked structure (ex: a highly connected graph) will probably slow down the process.

From all this, it follows that the amount of time it takes to do a Full GC is largely dependent on the number of objects still alive that must be copied, which largely depends on the application logic and the amount of memory to scan (AKA: the Old Gen memory size).

Some Caveat

It is custom practice to set the memory sizes by specifying the same value to two parameters. The JVM accepts a Max size and a Start size. If you set Start to less than Max, then the JVM allocates Size and grows when needed. This growing (and shrinking) is expensive, so most people usually sets Max and Start to the same value. For example:

the Perm size can be set to 512Mb set with -XX:PermSize=512m -XX:MaxPermSize=512m
the Perm total size can be to 8Gb set using -Xms8000m -Xmx8000m
the New total size can be set to 128Mb with XX:NewSize=128m -XX:MaxNewSize=128m

How to get information on Garbage Collection

In order to know what is going on at the GC level, you can add the following parameters to your JVM on startup:

-XX:+PrintGCTimeStamps-XX:+PrintTenuringDistribution-XX:+PrintGCDetails-Xloggc:<file path>

Also, you can use the jstat command to see what's going on in real time on a running JVM (it requires jstatd and probably some configuration settings). The command takes the form:

jstat -gcutil <pid> 1s 1000

where 1s is 1 second intervals and 1000 is the number of samples to take. Full jstat command line arguments for Java 5 can be found in this link.

This command is interesting because it shows memory usage in percentages and also the number of Minor and Full GC performed on the system with the total accumulated time used for each type of GC. You can then use this to figure out how long it takes to perform each type of Garbage Collection.

So, how much memory does a JVM need?

This actualy depends, as it always do. There is, of course, a minimum below which your application will not even start. Obviously, you need more than that minimum.

Now, if you are a lucky bastard, then you probably can get away with having NO Full Stop the World GC. Yes, I said before that you can count on death, taxes and a Full GC. So, how is it possible to getting away with no Full GC?

Well, it has to do with the definition of lucky bastard. You fall into that category if you can manage to satisfy these two conditions:

have an application activity cicle that has a very low activity period. For example: you serve customers in just one country and you have near zero activity at 3 am.
have an application that, measured between two adjacent low activity cicles and without a Full GC, consumes less memory than your server's real memory. For example: between today at 3 am and tomorrow at 3 am, your app requires 4 GB total memory and your server has 8 GB phisical memory.

Update: i will clarify on rule 2. Let's say you realize the JVM executes a Full GC every hour and that GC releases 1Gb of OldGen memory every run. Then, on a 24 hour cycle, that JVM will require 24Gb of memory, plus the minimum required memory, plus some extra memory for safety to avoid executing a Full GC. If your server has 48Gb of memory, then allocating, say, 32Gb to the JVM should do the trick.

In other words, lucky means that you can allocate 6 GB of memory to your JVM (which avoids a Full GC during working hours) and you also restart your app at 3 am, before the Old Gen gets full.

Yes! You get away without a Full GC because you kill your JVM before it actualy needs it. This approach is being used by many finnancial institutions (specially high frequency traders) to avoid the latency of a Stop the World GC.

Now, if you are an average mortal, your JVM will actualy stop to GC at some point or another. So what's the size to use for the Old Gen? It depends on your processing requirements and requires trial and error. What you must embrace is the fact that the Full GC cannot be avoided. What you can do, however, is to control for how long the Full GC stops the world. As said, that depends on the total number of object, which in turns depends on the total memory that the GC must scan and copy.

So the value that you must set can actualy be tuned to your needs, but the unecpected thing is that the memory size might have to go down for you to hit your target SLA requirements (ex: all requests must take less than a second to be responded).

I know this is counter intuitive, and frigthening. A few days ago, i was talking to a customer and said that the total configured memory a JVM had should probably go down from 4.5Gb to 3Gb to reduce the Full GC stop time from 2.5 seconds to less than 1s. A few eyebrows were raised and some fearfull looks crossed the room.

Nobody is happy with the idea of reducing the allocated memory to a process, specially in these days of Moore's law when cheap translates into a default of more is better.

But the GC algorithm is time linear on memory size and perhaps it's better to have frequent shorter stops than less frequent but longer stops. Your SLA might call for this.

You see, sometimes less is better.

So, what is your target Memory value?

Well, it's a trial an error, but now you know how to tune the value. You check how frequent is the Full GC and how long it takes and figure out if you are a lucky bastard. If you are, then all set.

If you are not, then the duration of the Full GC should tell you if you have to add or remove Old Gen memory and then tune the size slowly until you get the expected results.

There is also some good news to those lucky bastard wannabes. If you are elastic on the number of CPU/Cores/memory available, then you can get away with no Full GCs, even if your app has significant activity all day long. All you have to do is monitor when your JVMs are approaching the Full GC mark and restart them. An Old Gen at 95% utilization can mark a good time when you recycle the JVM, but this also needs tuning.

Whether you are tuning the NewMem size or the OldGen size, the two most important things to remember are:

that the value you set has direct impact on the duration of the GC cycle, and
that the type of garbage created depends on your applications, so the values for one app will not necessarily work for a different app.

What about Application Logic?

As i mentioned before, two factors affect the Full GC time. The first one is the memory size which we just covered.

The second factor is Application Logic. More precisely, the reference structure and the pattern of change of Long Lived objects.

The more intricate the links between objects, the more expensive it becomes to move an object around in memory, due to the pointer fixing. So you probably don't want a big mesh of objects referring to each other. A simple reference structure is probably the best choice here, unless you absolutely need something more complex.

The other issue to control is the pattern of change. It will generally be best if your long term objects (ex: cached options and configuration) are mostly stable over long periods of time. The reason being that a destroyed long term object causes a hole in memory that must be reused, which in turn requires moving around other long term objects.

Are we stuck with Full GC?

There is also a bit of other good news, if you can afford to spend some green currency bills.

There is a company called Azul Systems that created a special brand of JVM called Zing. One of the biggest features of Zing is that it has a GC that makes no Full Stop the World GCs.

I haven't used it and I have no relation with Azul Systems, so i can really say much more about it, but it sure sounds cool.

Some extra info

If you want to know more about Garbage Collectors, you can watch this InfoQ presentation by Gil Tene (CTO and co-founder of Azul Systems). It's about Garbage Collectors in general. Again, I have no relation to Azul Systems in any way, but the information on this presentation is very good.

This is enough to be my thirteenth post.

Harder to kill that cockroaches

2012-08-27T09:00:00.000-03:00

About 300 millions of years ago, cockroaches appeared on Earth as a life form. I'm sure the first ones were very primitive (as it is with every life form), but at some point one specific kind of cockroach evolved, and it took over the Earth, because it was the most advanced cockroach that ever existed! This one was extremely resilient to almost everything (even radiation) and will probably outlive us all. Some variations of this insect evolved later (there are 4500 sub-species, 30 of which are related to human habitats, with specific variations by continent).

In 1990, the first web browser came to exist. It was called "World Wide Web". I'm sure it was primitive (compared with what we have today) and it spun a whole lot of new experiments. Soon we had subspecies like NCSA Mosaic (and i'm old enough to have used it), then Netscape (which i've also used), Internet Explorer, Firefox, Safari, Chrome, Opera and many many others. Not sure there are thousands of browser subspecies, but i would say there are at least a hundred, easily.

You probably see were this is going, but in August 2001, the most advanced species of browser appeared on Earth, and it took over the planet. It was Internet Explorer 6. There existed many subspecies (different service pack level, different language, different configuration and enabled options). Some of these subspecies were geography specific. For example, I understand that South Korea has a government mandated authentication protocol that was supported only by IE 6. There is also the chinese version of IE 6 which is used in... China.

IE 6 inherited an adaptation from it's previous primitive sibling: AJAX. It took it hidden in its DNA until it became a real evolutionary advantage, circa 2005, when evolution in behavior began, in the form of innovation. And it was a boom. It learned so many new tricks in a few years that it changed the environment forever, but it was not very well suited to this new environment. Evolution took over again, with Firefox, IE7, Chrome, etc.

When it first appeared, it was the most advanced on Earth... exactly 11 years ago, on August 27, 2001.

But 11 years in the computer industry is like some number of millions of years in evolution, and the damn thing is still lingering on, like a cockroach that refuses to die. When support was finally discontinued, Microsoft itself declared that IE 6 is a fossil, but it's a living fossil.

Not something of the past. A living fossil! Something you would expect in a Hollywood movie like Journey to the Center of the Earth. Or maybe on a Steven Spielberg movie where some scientist recovers a portion of bits from an old damaged CD-ROM and recreates it using some bits from another application.

In an effort to eradicate the thing, Microsoft created the IE6 Countdown site, to track where on the planet this species is still a pest, carrying deadly deseases to developers, businesses and users alike.

Still 6% presence, mostly in China, were it has 21% market share! I have been tracking the evolution of the fight and I'm happy to say that from June 2012 to July 2012, Argentina (where i live) crossed the barrier from 1.2% market share to less than 1% (0.9%, to be precise). Argentina it's now marked green on the map!

But is a die-hard. Even Norway (that crossed the 1% barrier long ago) can't still get rid of it, with 0.1% market share.

So, what can you do? When you see a cockroach in your kitchen don't you kill it? Do the same thing with IE6. When you see someone using it, kill it. Go to the web, search for another (any) browser, download it, install it and feel good you did a good thing for that someone.

IE 6 did good. Changed the environment to something more hospitable. Enabled an explosion. But is now past its extinction time.

This is enough to be my twelfth post.

DB Connection pool loosing connections (AKA: the not so transparent firewall)

2012-03-30T19:22:00.002-03:00

I'm posting this because i've seen this problem many times, specially in corporate environments.

The symptom

You have an application (usually a java app running on an app container such as tomcat) that connects to a (any) database. You change from using a simple "connect to the db when you need it" to "use a connection pool to eliminate connection setup time".

You immediately rollback the connection pool change because the connections seems to die after some time (usually low usage or night time), making your application unresponsive.

The explanation

Well, there are many possible causes for the symptom, but one of the most common causes i've seen (specially in corporate environments) is due to the widespread use of transparent firewalls.

As a side note: be wary of anything that calls itself a transparent something, such as transparent firewalls or transparent proxies because they are, at best, translucent.

To understand the problem, we must first understand that the TCP stack closes a connection after it interchanges a number of packets with it's peer TCP layer. The message flow is complex because a TCP connection can be half closed, meaning one of the peers is done with it but the other is still sending. In order to close a connection, one of the peers can send a message with the FIN flag set (which triggers the FIN-WAIT statuses you can see with the netstat command). The FIN-WAIT status then progresses to FIN-WAIT-2, then to CLOSING and then to CLOSED.

Another alternative to connection closing is for one peer to send a TCP RST (reset) packet, indicating that the whole connection is in failure mode and shall be terminated.

TCP keep-alive was invented so that long lived connections could be kept alive. This includes exchanging ACK (acknowledge) packets between peers every now and then (usually configurable at the kernel level).

When keeping alive a connection, if a host doesn't hears from it's peer, then a lengthly series of attempts to reach the peer begins (we'll come back to this later). Upon failure, this series ends with the connection in the TIME-WAIT status that occupies a connection table slot for quite some time.

Enter the corporate firewall.

The most common usage of a corporate firewall is to actually do connection filtering. This means: telling what host can connect to what other host+port. I'm not really sure why a firewall is used for this, because most basic routers (and even switches) can do this kind of thing, albeit more in an NAT kind of way, not in an ip-hiding way.

Connection filtering works something like this:

When a connection is attempted, the firewall looks at the filtering rules to check if the 4-uple is in the list of allowed connections.
If it's not on the list, then one of the following happens:
- an RST packet is sent to the source host, blocking the establishment of the connection. Not the best option, as it informs the source host that the remote ip does exists but doesn't like the port.
- nothing is sent, letting the remote host retry until it timeouts. This is good, as the source host does not know the remote ip is valid, preventing address/port network scans.
If the 4-uple is on the list, then both of the following happen:
- An entry representing the 4-uple is created in an in-memory table inside the firewall software.
- The connection setup packet is forwarded to the destination host (oftentimes with some IP address changing).

So far so good.

Now, for a reason that escapes me, many (most?) corporate firewalls don't like low traffic long lived TCP connections. They don't like these connections badly enough to go to the extreme of detecting TCP keep alives in order to be able to euthanize these connections.

Low traffic long lived connections in corporate environments are darwinianly selected out by corporate firewalls. This natural selection is what's causing trouble in your environment.

The real problem is that the connections are not being terminated, just being euthanized. Yes, there is a big difference.

In a connection termination scenario, the firewall has to do two things:

Send each real peer a TCP RST packet. This gives the peers a notification of the massacre. This notification enables the host's kernel to return an error to the application which will notice the connection is closed and be able to re-connect.
Drop the 4-uple in the firewall valid connections table (thus blocking all following traffic).

But the connection is euthanized. This means the firewall usually does just step number 2.

The peers, not knowing about the transparent firewall, enter the lengthly process of trying to reach out each other. A process that is doomed to fail because the firewall is not forwarding the traffic.

Eventually, the (kernel) peers decide to give up and close the connection, letting the application know with an error. The key word in the previous sentence being eventually. In the mean time, the connections are zombies inside the kernel and the application is not notified until it's too late.

The Solution

The good news is that you can probably fix this problem without trying to let the people running the firewalls understand they are causing trouble and change defaults (after all, this is a silo'd corporate environment, what are your chances of effecting change on a different department?).

If you are using a standard connection pool, there are two options you have to set to avoid triggering the euthanasia:

Set the timeout (or keep alive) connection time to a lower value than the firewall's idle value. If the guys running the firewall don't tell you what this value is, you can always hand test it.
Set the option to do an active keep alive. This will effectively send a command to the database causing real traffic to go thru the connection. This involves a command without side effects (such as select 1 from dual, select current_time or something like that).

If you are not using a connection pool but an ad-hoc application-level protocol, you can always create a nop (no operation command) that responds some data.

This is enough to be my eleventh post.

Will extreme poverty ever end? Nuru International

2012-03-08T00:27:00.000-03:00

Hi,

after watching a Google Tech Talk by Jake Harriman, founder and CEO of Nuru International (nuruinternational.org), titled Will Extreme Poverty Ever End?, three things got into my mind.

The very first one was to actually go and make a monthly donation. Go ahead and do so yourself. Not only you will never die of a severe case of donation, it also increases your overall happiness, which reduces stress. So go and do something good for others and yourself (or yourself and others).

The second thing i couldn't stop, is wondering about this guy. A US Marine that went to Afghanistan (something that some people, including me, considers to be one of those bad things the US do abroad). This guy decided to keep fighting terrorism on very different front lines: those of NGOs and humanitarian relief.

I can't help to think that when a soldier decides that she or he will be more effective fighting terrorism by eradicating poverty, then a lot of good things are going on in the world. Many many more good things that we would assume if we just listen to mainstream news organizations, or old folks saying things used to be better.

While i never had the paralysis problem when facing a decision of letting go of things and start anew, it is very telling that changes like the one Jake Harriman did are not only possible, but feasible.

Bold move.

The third thing that got into my mind is innovation. You can tell that this guy took a lot of bagage from the Marines (well, you can't stop being who you are) and applied it to the field of humanitarian relief. Nuru works like a contractor. They actually don't try to reinvent the weel, but to figure out the most effective weel for each type of task. Nuru partners with other NGOs so that both NGOs can better fulfill their respective missions. And Nuru is trying to actually measure things to figure out what works and what doesn't, to try new things and fail fast.

A very agile style.

I couldn't help to compare this approach with the Record Labels's and Entertainment Industry's. We recently were bombarded with the whole SOPA/PIPA debate (even here in Argentina was all over the news). These industries don't try new things. They don't even like others to try new things. In the case of Hollywood, they even disregard they own history. Independent studios relocated to the US west coast because of the many patents and laws on the east coast were stifling innovation and new business models. And these independent studios grew to be Hollywood.

So, instead of collaborating with Napster, Labels shut it down. Neither got the chance to better fulfill their respective missions.

Nothing like contrast.

This is enough to be my tenth post.

Donate to Wikimedia Foundation for 2012 Fund Raising

2011-12-29T14:54:00.000-03:00

Once again, a year is about to end.

As i did last year, i've already donated money to the Wikimedia Foundation.

Last year, Wikimedia Fundraising lasted 50 days and got to cover the expected 15 million USD.

This year, Wikimedia is expecting to raise 28.3 million USD. Donations will be accepted through January 2012.

Do you use Wikipedia? Then donate now!

Human knowledge should be a public good, like the environment.

Wikipedia does just that: make knowledge available to everyone that wants it. Please help!

This is enough to be my ninth post.

Dennis Ritchie passed away

2011-10-15T19:48:00.001-03:00

Dennis Ritchie was found dead on his apartment last Wednesday (Oct 12). He apparently died last weekend.

Dennis Ritchie created the C programming language and co-created Unix with Ken Thompson. He also worked with Brian Kernighan.

This is a very sad October.

Steve Jobs passed away

2011-10-05T21:50:00.000-03:00

After a little over nine months of not writing on this blog, i'm very sad the reason for this post is the passing of Steve Jobs.

Much can be said about the good and bad (of Apple, the philosophy, the technology, his character, etc).

None of that matters to me today. I feel really sad.

Having lost family to cancer, i only hope he didn't suffer much.

Donate to Wikimedia Foundation!

2010-12-31T17:15:00.000-03:00

The year is about to end (here in Argentina it's still 2010) and the Wikimedia Foundation is doing another round of fund raising to sustain next year's operations. They need only 16 million US dollars (they have 15.2 M already).

Do you use Wikipedia? The english version? The spanish version?

I have donated already. In the process, i discovered that there is an Argentine chapter of the Wikimedia Foundation (not every country has one).

You can donate here (english) or here (spanish). You can also donate to the local chapter (see the bottom of the page).

You can check how much money they have already raised by just going to the main wikipedia's page for any language. There is a detailed fund raising page, yet it seems to have different information (out dated?).

I hope you all have a happy 2011 and, if you didn't donate in 2010, consider making a donation as one of the first actions of the new year.

This is enough to be my sixth post.

About (abuse of) Power

2010-12-20T19:42:00.000-03:00

This is an unusual post as it started with something unrelated to IT but i think recent events made it relevant to our industry because of the Java TCK licensing kerfuffle (AKA: Oracle versus ASF).

A few months ago, i was having lunch with some friends and we were talking about power in politics. Abuse of power in politics, to be specific. After all the talking, i realized that we tend to believe power has properties that it actually doesn't have.

We tend to think power is a solid. Something like bricks.

The idea being that we can get as much power as we can pile bricks.

Such was the mental image i had when the conversation started, but the image changed rapidly.

I figured that comparing power with a solid was not entirely right. I was missing something.

I think power is more of a fine grained solid. To continue the construction metaphor: power is like sand.

You can have some small amount of power and hold on to it, the same way you can do with sand.

You can even succeed and get a lot of power for a while.

Yet, as it happens with sand, power filters out eroding while it filters, making the leak bigger and bigger.

And as you see the leak you begin to get desperate because you are loosing what you hold dear. And you start doing stupid things. Just as anybody does when they have lots of power.

This has happened to genocidal maniacs, dictators, presidents, political parties, police, mobsters, gangsters, companies, CEOs, bosses, abusive husbands, abusive wives, child molesters, bad teachers, bullies, big brothers, little brothers, etc.

We all have abused power in some situation or another.
We all know abuse comes back to balance the score.

I think we have recently seen an example of abuse of power in the way Oracle used Java's ownership. My guess: as a way to get Google to pay for using Java on Android (J2ME enabled phones probably pay royalties of some sort). I don't have anything against Oracle making money, yet i personally disagree with this move because it gets the open source programming community and the ASF in the middle of a corporation war.

Oracle may succeed in the short term in getting some more money, but the forces propelling open communities will route around this issue and when that happens, this will undoubtedly come back to bite Oracle on the hand.

Abuse of power has happened before and will happen again.
Loss of that power has happened before and will happen again.
No matter who. No matter why.

Disclosure: i'm not active in any open source community.

This is enough to be my fifth post.

PS: photographs are from stock.xchng and iStockPhoto

(not) Messing up your Data Model because of your ORM

2010-10-12T01:08:00.000-03:00

A few months ago, one of my clients asked me to take a look at the Data Model of one of his applications because of performance problems. A review showed database tables had no indexes, that being the cause of the performance hit.

Yet a deeper and complex structural problem surfaced upon further inspection. This second problem was, and still is, the result of lack of knowledge of both Databases and ORMs, so i thought it was worth sharing.

Refactoring an application usually isn't a big deal by itself. But when you have a Database involved, refactoring turns into a very complex problem beyond the realm of a single programmer, because data needs to be restructured. Moreover, when data is comprised of millions of records and over a terabyte of storage, the restructuring process can take months making it impractical and even forbidden by laws and regulations, as it is for this particular case.

The Application and the Implementation

The troubled application stores transactions in a database. It consists of a web service writing to the database and a web interfase to query inserted records. Transactions are composed of primary meta-data and three optional groups of data: Extra meta-data, Business data and LowLevel data. Each four of the data groups are stored on their own table. Tables hold historic information because it must be available for querying at any time and because regulations require it.

When development started, an Object Model was created for the write web service that could be described as:

Four classes to model a transaction:
class Transaction: holds primary meta-data

class Extra: holds extra meta-data

class Business: holds business data

class LowLevel: holds low level data

These classes are POJO objects; and there is also a DAO class to abstract database access. The classes Extra, Business and LowLevel are attributes of the class Transaction.

Here is when things started to run amok and reality and implementation started to diverge.

Note: remember Euclidean Geometry? What happens when even a very small divergence between two rect lines is propagated to a very large distance?

Let's see first how the Good Data Model should look like:

The Good Data Model

As the image shows, there should be a table named Transaction, with a monotonically increasing primary key (an auto-incremented id, mapped for example to an Oracle Sequence or an MS SQL Server IDENTITY). There should also be tree satellite tables (Extra, Business and LowLevel), each with just a primary key and, if you feel like it, a foreign key to the Transaction primary key. Notice that this design preserves the fact that satellite tables are dependent upon the primary table. This design is guided by what is called Normal Form in Database courses at colleges.

However, the implemented Data Model, we will call it Bad, looks like the following:

The Bad Data Model

As the image shows, now each table has an IDENTITY primary key. There are also three attributes on the Transaction table, each having the value of the primary key for the record in the corresponding satellite table.

I believe two separate things conspired to make the programmer create this Bad Data Model:

The Object Model makes the four classes (types) independent of each other. An instance of each class can exist on its own as you can call operator new on that class. Even when this makes syntactic and semantic sense for the Object Model, it doesn't reflect reality and it doesn't make syntactic nor semantic sense for the Data Model.
Not knowing the full set of capabilities of Hibernate ORM. When asked, the programmer's answer was that Hibernate cannot map things like the Good Data Model, even when it made sense from the Database point of view. Of course, Hibernate can properly map the right Data Model (see One-to-One bidirectional association on the Hibernate docs, specially the relation between the Person and PersonAddress tables).

The implemented Bad Data Model created two very serious structural problems. Let's try and understand how and why of these problems.

Temporal Causality no more

The implemented Bad Data Model has one very important side-effect related with the record ordering.

Given the structure of the Bad model, and because of performance reasons, the insert sequence looks like the following (this is just mockup code):

// create a transaction

hibernateTrans.create();

// insert into the ***SATELLITE*** tables

hibernateTrans.save(extra);

hibernateTrans.save(business);

hibernateTrans.save(lowLevel);

// set the satelite tables id in the main record

trans.setIdExtra(extra.getId());

trans.setIdBusiness(business.getId());

trans.setIdLowLevel(lowLevel.getId());

// save the main record

hibernateTrans.save(trans);

// commit the transaction

hibernateTrans.commit();

As you can see, the insert order is reversed compared to what common sense dictates: Transaction then satellites versus satellites then Transaction. This is because the Transaction table needs to store the references (IDs) to satellite tables and those are only available after the records are inserted. If the order is not reversed, then a later update must be executed on the already inserted Transaction record.

As long as you execute in a single threaded environment, such as development, you can guarantee that for two consecutive transactions T1 and T2 (and the satellite tables) the invariant: "T1.id < T2.id implies T1..Business.id < T2..Business.id" holds true, the same can be said for the other satellite tables. The invariant provides temporal causality across tables of the Data Model. It does so because as IDs are monotonically increasing you can safely infer time causality on a table: lower id means the insert happened before.

However, in a production environment, with the inserts running on multiple threads, that is no longer the case. To make it clear, imagine that transaction T1 is taken by thread T1 and that transaction T2 is taken by thread T2. The following image shows a possible sequence of execution:

The green arrow marks the execution flow between the two threads (the horizontal lines are thread switches). You must remember that the selected ID for each inserted record is decided by the Database, based on the last used ID for each specific table.

At the base of the image you can see the list of IDs assigned to each table: Transaction, Extra, Business, and LowLevel. As you can see, each list is a mixed set of values because the records are inserted in different order. Thus, the invariant does not hold.

This seemingly unimportant fact has very deep implications for a Database server, because it is custom practice that records are physically stored on disk in Primary Key order. Thus, when you JOIN rows from the Transaction table with one of the satellites, records on both tables will not be on the same order forcing the Database to perform the extra step of ordering. This ordering step can be executed: 1) in memory (consuming a lot of resources); or 2) on the fly by making disk seeks for each matching record (not cache friendly).

Now, contrast what happens with the Good Data Model. The sequence is to insert first the Transaction record and get the ID (primary key value). Then, the same ID is used to insert into the satellite tables. This holds the temporal causality invariant and forces the same order on all tables, thus enabling the database server to read records stored in the same physical order, in forward-only mode. This is very disk cache friendly and supports data read-ahead and data-prefetch algorithms.

Note: for those of you that have the "concurrent and parallel programming" gene, perhaps i should make explicit the fact that loosing temporal causality is due to the lack of a "critical section" primitive when accessing the database. While the "get the next id" operation is atomic, you must realize that the Begin/Commit/Rollback transaction construct only guarantees that the sequence is executed or not at all, it says nothing about executing undisturbed. In fact, for many versions of MSSQL Server a Rollback doesn't undo the "get next id" associated with an Identity field, making it forever consumed. I think a rollback doesn't undo an Oracle Sequence either.

Query optimization is severely affected

The second structural problem is perhaps more important, because it deeply affects query optimizer's ability to change query evaluation order. The best way to understand why is to think in terms of solving the following query:

SELECT E.field1, B.field2
  FROM Transaction T
  INNER JOIN Extra E
    ON T.idExtra = E.id AND /* conditions on Extra */
  INNER JOIN Business B
    ON T.idBusiness = B.id AND /* conditions on Business */

Note: the JOIN conditions for the Good Model will be different. For example, the Extra table JOIN will have T.id = E.id as condition instead of T.idExtra = E.id.

Despite many differences between specific database products, we can think a generic database server will solve this by computing one partial temporary result (subquery) for each table (Transaction, Extra and Business). The server will then proceed to calculate the intersection between all the matching temporary results. In Relational Algebra this matching is called a Natural Join and can be thought of as a set intersection (∩) or and operator. This last step can be represented then by the expression: Transaction and Extra and Business.

A lot of the biggest optimizations happen when the server calculates this intersection. To understand how this happens and why, it is handy to think there are a set of functions that relate different tables of the Data Model. These functions have a single parameter (the Primary Key in the source table) and return the Primary Key of the associated record in another table. The next image represents these functions:

Note that there are a pair of functions between tables. For example, between Transaction and Extra table there are E(t) ⇒ e and the inverse Ei(e) ⇒ t. There are also relations between the satellite tables of the model. For clarity, the image shows only the relation functions between the Extra and Business table. Lets examine first the functions E and Ei.

For the Bad Model, E(t) is very simple. Just return the field T.idExtra (see the JOIN condition of the example query). The inverse Ei(e) is more complex. You start with the primary key of Extra record (e) and then you must find the primary key of the associated Transaction record. To do that you must either: 1) read every Transaction record to find the one with idExtra = e, or 2) create an index on the idExtra field. The index is obviously faster, but it still requires some disk accesses.

For the Good Model, the function E(t) is also simple. Just return the same t (remember JOIN conditions are different). The inverse Ei(e) is equally simple. Just return the same e. No extra indexes, no extra disk accesses. Just the identity function, as expressed by the JOIN conditions.

Now just for fun, lets try to go from the Extra table to the Business table. The database will have to synthesize the functions EtoB(e) ⇒ b and BtoE(b) ⇒ e. EtoB(e) can be written as a function composition: EtoB(e) = B(Ei(e)). That is: given e, find the t that matches and then with that t, find b. Similar steps must be taken to create BtoE(b).

For the Bad Model, Ei requires an index search while B just returns the idBusiness field of the specific Transaction record, but that means the database must read the record pointed to by e. For the Good Model, we know that both Ei and B are the identity function, so B(Ei(e)) = Identity(Identity(e)) = Identity(e) = e. No reads, no index searches. Again, just the identity function.

Lets go back to our example and put all this in practice. Imagine the subqueries get executed and we get 500 results from table Extra and 1100 results from table Business, having a total of 1 million records in table Transactions (no conditions on Transactions on our example query). To compute Transaction and Extra and Business, our generic database will do one the following:

Bad Model without indexes: must go through each of the 1 million records of Transactions to be able to match the Extra and Business records, because the relations E(t) and B(t) go in one direction and the Ei(e) and Bi(b) require to scan the whole Transaction table (called a "tablescan"). To find a matching record by field idExtra or idBusiness you would have to perform a tablescan, so starting with table Transaction is the same. This means at least 1 million Transaction reads, plus the Extra and Business record sorting and matching.
Bad Model with indexes: for each of the 500 results from table Extra, will have to do an index search to find the associated Transaction record to be able to match the corresponding Business record. Furthermore, the records on Business will not be in the same order that the records on Extra, so this requires a disk seek for each record. This means: 500 index searches on Transactions, 500 Transactions reads, 500 seeks on Business.
Good Model: because of the identity function, the server knows that it can safely compute (Transaction and (Extra and Business)) without altering results. That is: consider the intersection operation to be associative. Calculating the Extra and Business part involves a forward-only record match between 500 and 1100 records (same key on both tables, therefore same order). The records on table Transactions will have to be read because it's the table in the SELECT and not checking primary key existence on Transactions will break query semantics. This all means: forward only match of 500 and 1100 records (no internal sorting) and 500 forward only reads on Transactions.

The Bad Model breaks associativity because the relational function to go from one table to the next is not the identity function. The Bad Model with indexes is not as bad, but still causes a lot of disk seeks (the most expensive operation on disks). The Good Model preserves associativity and that enables the server to generate huge savings in disk seeks and reads. Database implementations have many other optimizations for identity key match, hence that kind of match should be used whenever possible.

The fix that is not

The fix must be software only and it must involve both components of the application.

First, the web service that inserts records must recreate the Object to Data mapping to use the Good data model. The relation fields idExtra, idBusiness and idLowLevel can be left blank (NULL value). Foreign Keys, if present, must be dropped from the Database.

Second, the web application that queries records should be modified to be able to use the two data models, Good and Bad. The selection of the model to use is based on a system wide property that marks the moment in time where the Good data model began to be used. Because queries can span both models, a method of joining the results of querying both models in parallel must be implemented. The querying application is also able to do some basic reporting by returning aggregate counts. This reporting have to be changed to span both data models. Think about the complexity of calculating averages crossing different timespans.

These changes are a full blown project on their own, but the biggest problem is that future changes and feature adds to the application will duplicate the testing effort as they must be tested on both models.

A second alternative would be to create a new version of the application using the Good model and not to provide for model coexistence. With this approach, new installs will benefit, but your existing customers will be stuck with an unsupported version. This is not commercially viable, as this option implies to support two versions of the same application forever, duplicating development and testing.

A third option was briefly considered but quickly discarded. It involved sending a T-101 to the past.

The fix is that no fix would ever be applied. Data cannot be changed, Maintenance Costs cannot be increased. The problem is assumed to exist forever.

Conclusion

There are a number of conclusions:

Data Model and Object Model are two different things for a good reason. If you decide to force one onto the other, it better be a careful and informed decision.
Databases and ORMs are very complex tools; sometimes as complex as programming languages. It is your responsibility as a programmer to properly understand them. ORMs have been widely criticized or blindly used or both. SQL Databases have recently started to be very criticized too.
Local properties (Atomicity) do not imply global properties (Temporal Causality).
Once your application or product is deployed in a production environment, course corrections in the form of refactoring are not always possible. Many times logic is free to evolve while data is anchored to past versions of that logic.
In this particular case, the insert web service was programmed months before the query application, and by a different programmer. This translated into little initial engineering effort and set a bad course that is now impossible to correct.

This is long enough to be my fourth post.

PS: many thanks to Alejo Sanchez for reviewing early drafts of this post and making so many suggestions to make it better.

Exception madness

2010-09-07T10:24:00.001-03:00

Before exceptions became main-stream technology in programming languages (about 1.5 decades, i think), error control in programs was a delicate matter.

The problem was due to a common design pitfall that plagued (and still plagues) many technologies. That pitfall is named: in-band signaling.

Note: the name "in-band signaling" comes from the telecommunications industry (see wikipedia entry). The term comes from the fact that when you dial a number, the number itself is sent as sound over the line (you hear the tones, isn't it?). In-band signaling doesn't seem to be a problem, until you realize that other functions of the telecommunication network work the same way. Thus by knowing the right code, you can just dial it on your phone and lo and behold, the next call is not billed and things like that. That used to happen a lot in the 70', 80' and early 90' because of the modem cost of communications. I think it doesn't happen much today because once you pay for Internet access, the world is right at your fingertips.

Just think about it. If you don't have the exception infrastructure, the exceptions must be then returned as a special form of result of your method or function.

In languages like C, it is very common to see things like:

FILE *file = NULL;
if ( (file = fopen("/some_file", "w")) != NULL) {
   /* oops */
   extern int errno;
};

Notice that the function fopen is supposed to return a file handler (of type FILE *), but if an error occurs, then NULL is returned and you have to check the errno variable to see what went wrong.

Do you see that the error is sent in-band with the result?

This caused an awful lot of problems with many programs. Even to the point of using the most obnoxious pair of library calls in the C environment (be it setjmp and longjmp).

What did this two calls provide? In plain english: the ability for a program to set(jmp) a recovery point were bad situations could be handled and the ability for that program to (long)jmp to that point when bad things happened. I have to say that you ought to be a terrorist to use these two function calls, so controlling for errors in-band infected regular programming as badly as a venereal decease.

Perhaps you recognize that jumping back to a place where you know how to handle errors is just a tiny part of what the exception infrastructure provides in modern programming languages, but there is a lot more than just the jump in exception handling. That lot more is what made the pair setjmp/longjmp almost impossible to use properly. These jumps actually worked by just instantly moving the execution to a previous program scope, whereas exceptions destroy intermediate scopes (meaning: cleaning up stack objects and giving a chance to destroy manually allocated objects for languages without Garbage Collection).

All of a sudden, with exceptions you can write much cleaner code. If, for instance, fopen would return an exception if the file cannot be opened, the following code will make a lot of sense:

try {
   FILE *file = fopen("/some_file", "w");
   // do something with the file
} catch (FileNotExistsException ex) {
   // handle error
};

However...

Bad things happen

In our case, 3 bad things to be exact.

The first bad thing is that the code to handle an exception gets separated from the code originating it. This actually increases complexity in our programs. Just consider our last example source code and imagine that the code implied by the line // do something with the file is actually 150 lines long (or even 15 lines long). Then, the line that generates the exception on the catch clause is not immediately obvious. In this case, the exception name will give a clue, but with more obscure exception names, the relation is not evident, just implicit and that augments complexity. You can of course put a specific try/catch pair for each call that can generate an exception, but that makes for bigger code. And think that we have not even considered the finally construct that is expected to be used to undo object creation side-effects. A comment stating the source instruction for each catch clause will help with this issue, but not solve it and is extra programmer effort.

This leads to the second bad thing. Given that properly writing exception handling code is an extra effort, many programmers started to do one of two things: 1) put a generic catch (Exception e) clause, or 2) just not handling exceptions at all. The first case just looses the chance for fine grained error handling. A single exception is completely abortive of the function or method.

The second case degenerates into the third bad thing, specially for web services or web applications.

Enter the third bad thing. Considering bad thing number one and bad thing number two, proper exception handling is complex, so when we have a number of people writing code, the only we can count on is exceptions will not be properly handled. But don't despair. We can put a catch (Exception e) at the highest levels of our app and, at the worst, use an all fucked up generic error page. Even better, we can use the exception class to select a proper message to display, so it works like charm.

Nothing particularly bad with that implementation, i even like it and use this idea, but only if it is not a replacement for proper error handling.

The real (unintended) problem with this approach is that, when uncontrolled, it fosters...

Exception Madness

Why is that? Well, when programmers can count on having a safety net below, they tend to get lazy (we tend to get lazy). Now, you can just throw exceptions for almost any code you don't want to write.

For example, let's say you have an input field in a form that needs to be even (divisible by two). Some programmers (sadly not a few) will write a utility function or method called even that receives a generic object or integer and returns a boolean (true) if the value is even, but throw an exception if the value is odd. Of course, if you consider the method name there is no good reason for it to throw anything other that InvalidParameterType if it accepts general objects as input; but if you are going to throw your stomach contents on to your caller, at least have the decency to call the method something like ThrowsIfNotOdd.

These kind of situations are doubly perverse. First because they abuse the last resort exception handling and second because they do not put business or application logic in the right place (if you are going to abuse, then the caller should throw, not the utility function).

Conclusion

Exceptions are actually a step ahead in error handling, a step i am glad was introduced to many programming languages; but i think the problem is that once again, we got the silver bullet syndrome with them. Exceptions are an excellent tool, but only to the extent they are used properly, and they have some side effects too that you should be aware of.

Enough for a very very very late third post. Sorry about that.

The flavors of Java concurrency control

2009-12-24T19:43:00.003-03:00

This second post took longer than i expected. I was involved in a witch hunt in my day job for a few weeks and then everything else started piling up. The witch hunt got me into reviewing a lot of Java code running inside an ESB to find any kind of performance hog i could spot with the naked eye. Yes, i know there are profiles, but good luck trying to run one of those things in a Bank's production environment.

During the review, i found a few possible enhancements, the most important one of those is what started this post.

I will discuss two methods of concurrency control available in Java 1.5, the synchronized modifier and the classes in the java.util.concurrent.atomic package, then some conclusions. I will not use java specific objects when trying to explain some language-neutral feature (a comment with the word "LN" will be added to such lines).

The synchronized modifier

This modifier provides a (no-brainer) way to implement mutual exclusion. You prepend the modifier to a method and then that method becomes guarded by a mutex. What comes to our minds when we read a sentence like the one before is that, for the following code:

int someValue = 0;

synchronized int getNextValue() {
   return(someValue++);
}

the compiler will generate code looking like:

int someValue = 0;
Mutex getNextValue_mutex = new Mutex();

int getNextValue() {
   getNextValue_mutex.getLock(); // LN
   int tempValue = someValue++;
   getNextValue_mutex.releaseLock(); // LN
   return(tempValue);
};

It is reasonable to expect that the Mutex.getLock allows only one thread to get the lock, while queueing other threads in a FIFO way; while the Mutex.releaseLock takes the first queued thread (if any) and schedules it for execution.

And that is what the Java synchronized modifier does. Well, not quite... it turns out that this modifier does a little more than just guard the method with a mutex (see IBM's article Threading lightly, Part 1: Synchronization is not the enemy.)

The truth is that Java's synchronized modifier does flush the processor's data cache on entry and commits it to memory on exit. So the real code executed will look more like:

int someValue = 0;
Mutex getNextValue_mutex = new Mutex();

int getNextValue() {
   getNextValue_mutex.getLock(); // LN
   Native.processorDataCacheFlush(); // LN
   int tempValue = someValue++;
   Native.processorDataCacheCommit(); // LN
   getNextValue_mutex.releaseLock(); // LN
   return(tempValue);
};

Sidenote: the Native class is there representing some low-level methods to handle the processor cache (written either in C or Assembly and completely platform dependent). Also, the whole cache needs not to be discarded, but figuring out what can be kept is not an easy problem to solve.

Two things to notice here:

If you come to think about this, the synchronized modifier will work only if the processor's data cache is handled this way, and
I suspect that Microsoft's .Net synchronized modifier works the same way.

The data cache must be flushed on entry because you really don't know the status of precached data the processor might have. It is possible that on entry to the method, the processor's cached data was fetched before another processor executed the same method, rendering the cached data unusable.

So yes, you must do that nasty thing to the processor's cache. Nasty being the right word here. Just consider that Intel, AMD and every other manufacturer devotes a very large number of transistors on every chip to data caches and predictive data pre-fetch.

There is a very important reason for this expenditure in transistors. Your Gb sized main memory is slow compared to the motherboard's bus frequency which is in turn very slow compared to the CPU's clock frequency. For example, DDR2 SDRAM has it's own clock running at half the speed of the motherboard's bus clock frequency which runs a fraction of speed of CPU's clock frequency. Some set of common values are: 266 Mhz / 533 Mhz / 2.1 Ghz (memory / bus / CPU).

To keep the CPU busy, the pre-fetch logic scans ahead the instruction stream to guarantee that the data stored in memory is at hand when the processor needs it. Consider that flushing the processor's data cache throws away all the good work the pre-fetch logic has been doing, meaning that the CPU has to sit idle until the pre-fetch has a chance to start refilling the caches.

This has a terrible effect on performance. Assuming that the cache flush affects only the entries of your program your other threads suffer from it. Imagine what the effect will be if the flush is not selective and discards also cached data from the operating system, services, etc.

There are two more things to pounder. The first one is that this cache flushing and committing should only matter in multiprocessor systems (like your servers.) That is, up until now, in a single multi-core processor system all cores in the processor share the same cache, yet i think that abundance of transistors is or will be changing that soon. I have not been able to confirm what the JVM does in these cases (i.e: if it really optimizes the single processor case by not doing the cache dumping.) But i don't think that's very important... How many people buy a single processor server these days?

The second thing to consider is that Intel and AMD are both talking about cache coherency and invalidation protocols for inter-processor communication. The idea is that if you have one processor changing a memory position, then it will broadcast that fact to other processors, so rather than invalidating their whole caches, the other processors can just drop the outdated cache entry. I think there are already some Intel processors able to do that, but i'm not sure of this.

To recap, the synchronized modifier uses a mutex that provides for MUTual EXclussion and guarantees that threads waiting for the lock are given the lock in the order they requested it. This also requires processor's data cache flushes and commits and, depending on the platform you are running, it also requires expensive context switches to kernel mode (to operate on the mutex itself.) Threads are de-scheduled when they can't immediately get the lock so they don't consume CPU if not able to proceed.

The atomic package

The atomic package (java.util.concurrent.atomic) was added on Java 5. This package provides a set of thread-safe read/write atomic scalar values equivalents (boolean, integer, long, references and array of these). The big thing behind these replacements is that they are lock free (no mutexes) and thread-safe.

The Synchronized modifier example we've seen in the begin of the previous section could be written as:

AtomicInteger someValue = new AtomicInteger();

int getNextValue() {
return(someValue.getAndIncrement());
}

Notice that the synchronized word disappeared and, being thread safe, no thread will get a partially updated value.

How is this possible? Thanks to a technique called Spin Lock. A Spin Lock is a method that, AFAIK, was devised first for intra-operating system synchronization in the face of multiple CPUs. If you had something resembling an Operating Systems course, you'll remember the Test and Set synchronization method (if you don't, then you can read about it on Wikipedia.) A Spin Lock is a loop around a Test and Set operation. This will make a program to keep looping (consuming CPU) until the lock is obtained. If you want to know more about Spin Locks (and many other synchronization options), check reference [1].

On Intel architectures, there is a single processor instruction that exchanges two data values. One variant of this instruction (taking a register and a memory address) can be used to implement an atomic exchange or, put in other words, a Test and Set. The instruction is atomic because it automatically places a low level lock on the motherboard's bus so that no other component can access the system while the lock is held (and as memory is connected to the bus, only the lock holder can use it).

Sidenote: this instruction can be traced on Intel architectures back to the 8086 and 8088 processors, where it required a special prefix to place a lock on the bus. In case you are thinking about it, the answer is: yes! As far as 1978, Intel processors have been multiprocessing-ready.

Now, what about the processor's data cache and the exchange instruction? Well, as you might have guessed, for this to work, the Test And Set must be cache ignorant. That is, this particular instruction causes the pre-fetch logic to ignore it, because the pre-fetched data could be altered after fetch but before use. Also, it is slow compared to other instructions, because it has to access memory right on the spot. This "slow" means that the whole system is penalized by a few bus clock cycles. The reality is of course a bit more complicated, because of dual gate memories, multiple buses, etc. But you should get the picture.

There is a sister package called java.util.concurrent, that provides some data structures (HashMap, Array, ThreadPools, etc) that use the Spin Lock mechanism to provide concurrent data reads and writes.

To recap, the Spin Lock method is implemented always at the process level (does not require the operating system's help.) It does not require the processor's data cache to be flushed, but it slows down the whole system (by a few clock cycles because of cache bypassing.) Threads are not de-scheduled when they are unable to get the lock. They are allowed to keep looping (consuming CPU) until they get the lock or the assigned CPU slot is fully consumed. This makes impossible to guarantee that threads get the lock in any particular order. It also means that the things that you can guard by use of a Spin Lock should be really fast and non-blocking (ex: no file or network I/O).

Why are there two methods?

Well, that's easy: these are two different tools that are used for two similar yet different tasks, in different scenarios. You just have to know when to use which one.

The definitive answer is of course dependent on the full description of the problem you are solving, but the following list of questions is more or less my recipe to decide:

Do you need a strict FIFO access to the shared resource? If yes, you must use the synchronized method. Example: if e-bay receives two bids for the same price, they want the first incoming one to win.
Does the code to be executed is fast and small and has no I/O or other blocking operation? If yes, then you can go with the Spin Lock. Remember that the Spin Lock keeps consuming CPU while the lock is not acquired.
Are you optimistic or pessimistic about the amount of contention on the resource? If you are pessimistic, then go with the synchronized method. If you are optimistic go with the Spin Lock variant. Here optimistic means that there will be low contention on the resource, so the busy loop of the Spin Lock will outperform the context switch (for mutex) and thread de-schedule of the synchronized case.

Besides that, i think it's safe to assume that if you need a data structure that doesn't change too often and you want maximum possible concurrency, the concurrent data structures provided by java.util.concurrent are in general a good option, but...

In a congested (CPU) system, the Spin Lock's "busy loop" could make the congestion even worse by wasting more CPU. Ideally a Spin Lock should try for some amount of time and then yield the CPU to allow for the lock holder to process and the lock to be released, but this can cause the thread to be delayed beyond what is acceptable under high CPU loads.

Conclusion

You should keep these two concurrency control options at hand in your tool-belt, but you should always start by using the most important of the tools you have at your disposal (AKA your brain) and don't get fooled.

One last example. A few months back, there was a different system suffering performance issues. One of the problems turned out to be that a one way (audit and trace) message was put on a queue to be processed in background. The decision to do this was sound, because the extra processing to be done could be made asynchronously in a different machine, making the foreground task faster, without sacrificing the audit functionality.

How was that a performance problem? Well, queues are complex data structures and as such, require the put and get operations to be synchronous (locking the queue.) This case was even worse, as the queue was part of a clustered ESB having multiple producers and consumers, on different machines.

While the system had an average load, everything went peachy, but when the load started to peak, the overhead caused by locks on the queue put/get operations began to be noticeable, up to the point where performance was hit.

Synchronization is all around you, even if you don't see it.

Enough for a second post.

UPDATE: many thanks to Alejo Sanchez for reviewing early drafts of this post.

References

[1] John M. Mellor-Crummey, Michael L. Scott, Algorithms for scalable synchronization on shared-memory multiprocessors, ACM Transactions on Computer Systems (TOCS) Volume 9 , Issue 1 (Feb. 1991) Pages: 21 - 65. You can download it for free from SiteSeer at Penn State University.

printf("Hello World!\n");

2009-11-08T13:07:00.000-03:00

Not much else to say, isn't it?

I mean, if you are a programmer, then you get the thing. If not, you are probably in the wrong place.

Enough for a first post.