2010-12-31

Donate to Wikimedia Foundation!

The year is about to end (here in Argentina it's still 2010) and the Wikimedia Foundation is doing another round of fund raising to sustain next year's operations. They need only 16 million US dollars (they have 15.2 M already).

Do you use Wikipedia? The english version? The spanish version?

I have donated already. In the process, i discovered that there is an Argentine chapter of the Wikimedia Foundation (not every country has one).

You can donate here (english) or here (spanish). You can also donate to the local chapter (see the bottom of the page).

You can check how much money they have already raised by just going to the main wikipedia's page for any language. There is a detailed fund raising page, yet it seems to have different information (out dated?).

I hope you all have a happy 2011 and, if you didn't donate in 2010, consider making a donation as one of the first actions of the new year.

This is enough to be my sixth post.

2010-12-20

About (abuse of) Power

This is an unusual post as it started with something unrelated to IT but i think recent events made it relevant to our industry because of the Java TCK licensing kerfuffle (AKA: Oracle versus ASF).

A few months ago, i was having lunch with some friends and we were talking about power in politics. Abuse of power in politics, to be specific. After all the talking, i realized that we tend to believe power has properties that it actually doesn't have.

Idea of Power - Forklift carrying bricks
We tend to think power is a solid. Something like bricks.

The idea being that we can get as much power as we can pile bricks.

Such was the mental image i had when the conversation started, but the image changed rapidly.









Reality of Power - Hand holding some sand
I figured that comparing power with a solid was not entirely right. I was missing something.

I think power is more of a fine grained solid. To continue the construction metaphor: power is like sand.

You can have some small amount of power and hold on to it, the same way you can do with sand.



Too much power - Hands pouring sand
You can even succeed and get a lot of power for a while.

Yet, as it happens with sand, power filters out eroding while it filters, making the leak bigger and bigger.

And as you see the leak you begin to get desperate because you are loosing what you hold dear. And you start doing stupid things. Just as anybody does when they have lots of power.


This has happened to genocidal maniacs, dictators, presidents, political parties, police, mobsters, gangsters, companies, CEOs, bosses, abusive husbands, abusive wives, child molesters, bad teachers, bullies, big brothers, little brothers, etc.

We all have abused power in some situation or another.
We all know abuse comes back to balance the score.

I think we have recently seen an example of abuse of power in the way Oracle used Java's ownership. My guess: as a way to get Google to pay for using Java on Android (J2ME enabled phones probably pay royalties of some sort). I don't have anything against Oracle making money, yet i personally disagree with this move because it gets the open source programming community and the ASF in the middle of a corporation war.

Oracle may succeed in the short term in getting some more money, but the forces propelling open communities will route around this issue and when that happens, this will undoubtedly come back to bite Oracle on the hand.

Abuse of power has happened before and will happen again.
Loss of that power has happened before and will happen again.
No matter who. No matter why.

Disclosure: i'm not active in any open source community.

This is enough to be my fifth post.

PS: photographs are from stock.xchng and iStockPhoto

2010-10-12

(not) Messing up your Data Model because of your ORM

A few months ago, one of my clients asked me to take a look at the Data Model of one of his applications because of performance problems. A review showed database tables had no indexes, that being the cause of the performance hit.

Yet a deeper and complex structural problem surfaced upon further inspection. This second problem was, and still is, the result of lack of knowledge of both Databases and ORMs, so i thought it was worth sharing.

Refactoring an application usually isn't a big deal by itself. But when you have a Database involved, refactoring turns into a very complex problem beyond the realm of a single programmer, because data needs to be restructured. Moreover, when data is comprised of millions of records and over a terabyte of storage, the restructuring process can take months making it impractical and even forbidden by laws and regulations, as it is for this particular case.

The Application and the Implementation

The troubled application stores transactions in a database. It consists of a web service writing to the database and a web interfase to query inserted records. Transactions are composed of primary meta-data and three optional groups of data: Extra meta-data, Business data and LowLevel data. Each four of the data groups are stored on their own table. Tables hold historic information because it must be available for querying at any time and because regulations require it.

When development started, an Object Model was created for the write web service that could be described as:
Four classes to model a transaction:
  • class Transaction: holds primary meta-data
  • class Extra: holds extra meta-data
  • class Business: holds business data
  • class LowLevel: holds low level data
These classes are POJO objects; and there is also a DAO class to abstract database access. The classes Extra, Business and LowLevel are attributes of the class Transaction.
Here is when things started to run amok and reality and implementation started to diverge.

Note: remember Euclidean Geometry? What happens when even a very small divergence between two rect lines is propagated to a very large distance?

Let's see first how the Good Data Model should look like:

The Data Model
The Good Data Model
As the image shows, there should be a table named Transaction, with a monotonically increasing primary key (an auto-incremented id, mapped for example to an Oracle Sequence or an MS SQL Server IDENTITY). There should also be tree satellite tables (Extra, Business and LowLevel), each with just a primary key and, if you feel like it, a foreign key to the Transaction primary key. Notice that this design preserves the fact that satellite tables are dependent upon the primary table. This design is guided by what is called Normal Form in Database courses at colleges.

However, the implemented Data Model, we will call it Bad, looks like the following:

The Data Model
The Bad Data Model
As the image shows, now each table has an IDENTITY primary key. There are also three attributes on the Transaction table, each having the value of the primary key for the record in the corresponding satellite table.

I believe two separate things conspired to make the programmer create this Bad Data Model:
  • The Object Model makes the four classes (types) independent of each other. An instance of each class can exist on its own as you can call operator new on that class. Even when this makes syntactic and semantic sense for the Object Model, it doesn't reflect reality and it doesn't make syntactic nor semantic sense for the Data Model.
  • Not knowing the full set of capabilities of Hibernate ORM. When asked, the programmer's answer was that Hibernate cannot map things like the Good Data Model, even when it made sense from the Database point of view. Of course, Hibernate can properly map the right Data Model (see One-to-One bidirectional association on the Hibernate docs, specially the relation between the Person and PersonAddress tables).

The implemented Bad Data Model created two very serious structural problems. Let's try and understand how and why of these problems.

Temporal Causality no more

The implemented Bad Data Model has one very important side-effect related with the record ordering.

Given the structure of the Bad model, and because of performance reasons, the insert sequence looks like the following (this is just mockup code):

// create a transaction
hibernateTrans.create();

// insert into the ***SATELLITE*** tables
hibernateTrans.save(extra);
hibernateTrans.save(business);
hibernateTrans.save(lowLevel);

// set the satelite tables id in the main record
trans.setIdExtra(extra.getId());
trans.setIdBusiness(business.getId());
trans.setIdLowLevel(lowLevel.getId());

// save the main record
hibernateTrans.save(trans);

// commit the transaction
hibernateTrans.commit();

As you can see, the insert order is reversed compared to what common sense dictates: Transaction then satellites versus satellites then Transaction. This is because the Transaction table needs to store the references (IDs) to satellite tables and those are only available after the records are inserted. If the order is not reversed, then a later update must be executed on the already inserted Transaction record.

As long as you execute in a single threaded environment, such as development, you can guarantee that for two consecutive transactions T1 and T2 (and the satellite tables) the invariant: "T1.id < T2.id implies T1..Business.id < T2..Business.id" holds true, the same can be said for the other satellite tables. The invariant provides temporal causality across tables of the Data Model. It does so because as IDs are monotonically increasing you can safely infer time causality on a table: lower id means the insert happened before.

However, in a production environment, with the inserts running on multiple threads, that is no longer the case. To make it clear, imagine that transaction T1 is taken by thread T1 and that transaction T2 is taken by thread T2. The following image shows a possible sequence of execution:

Possible execution flow

The green arrow marks the execution flow between the two threads (the horizontal lines are thread switches). You must remember that the selected ID for each inserted record is decided by the Database, based on the last used ID for each specific table.

At the base of the image you can see the list of IDs assigned to each table: Transaction, Extra, Business, and LowLevel. As you can see, each list is a mixed set of values because the records are inserted in different order. Thus, the invariant does not hold.

This seemingly unimportant fact has very deep implications for a Database server, because it is custom practice that records are physically stored on disk in Primary Key order. Thus, when you JOIN rows from the Transaction table with one of the satellites, records on both tables will not be on the same order forcing the Database to perform the extra step of ordering. This ordering step can be executed: 1) in memory (consuming a lot of resources); or 2) on the fly by making disk seeks for each matching record (not cache friendly).

Now, contrast what happens with the Good Data Model. The sequence is to insert first the Transaction record and get the ID (primary key value). Then, the same ID is used to insert into the satellite tables. This holds the temporal causality invariant and forces the same order on all tables, thus enabling the database server to read records stored in the same physical order, in forward-only mode. This is very disk cache friendly and supports data read-ahead and data-prefetch algorithms.

Note: for those of you that have the "concurrent and parallel programming" gene, perhaps i should make explicit the fact that loosing temporal causality is due to the lack of a "critical section" primitive when accessing the database. While the "get the next id" operation is atomic, you must realize that the Begin/Commit/Rollback transaction construct only guarantees that the sequence is executed or not at all, it says nothing about executing undisturbed. In fact, for many versions of MSSQL Server a Rollback doesn't undo the "get next id" associated with an Identity field, making it forever consumed. I think a rollback doesn't undo an Oracle Sequence either.

Query optimization is severely affected

The second structural problem is perhaps more important, because it deeply affects query optimizer's ability to change query evaluation order. The best way to understand why is to think in terms of solving the following query:
SELECT E.field1, B.field2
  FROM Transaction T
  INNER JOIN Extra E
    ON T.idExtra = E.id AND /* conditions on Extra */
  INNER JOIN Business B
    ON T.idBusiness = B.id AND /* conditions on Business */

Note: the JOIN conditions for the Good Model will be different. For example, the Extra table JOIN will have T.id = E.id as condition instead of T.idExtra = E.id.

Despite many differences between specific database products, we can think a generic database server will solve this by computing one partial temporary result (subquery) for each table (Transaction, Extra and Business). The server will then proceed to calculate the intersection between all the matching temporary results. In Relational Algebra this matching is called a Natural Join and can be thought of as a set intersection () or and operator. This last step can be represented then by the expression: Transaction and Extra and Business.

A lot of the biggest optimizations happen when the server calculates this intersection. To understand how this happens and why, it is handy to think there are a set of functions that relate different tables of the Data Model. These functions have a single parameter (the Primary Key in the source table) and return the Primary Key of the associated record in another table. The next image represents these functions:

List of Functions to relate tables in the Data Model

Note that there are a pair of functions between tables. For example, between Transaction and Extra table there are E(t) ⇒ e and the inverse Ei(e) ⇒ t. There are also relations between the satellite tables of the model. For clarity, the image shows only the relation functions between the Extra and Business table. Lets examine first the functions E and Ei.

For the Bad Model, E(t) is very simple. Just return the field T.idExtra (see the JOIN condition of the example query). The inverse Ei(e) is more complex. You start with the primary key of Extra record (e) and then you must find the primary key of the associated Transaction record. To do that you must either: 1) read every Transaction record to find the one with idExtra = e, or 2) create an index on the idExtra field. The index is obviously faster, but it still requires some disk accesses.

For the Good Model, the function E(t) is also simple. Just return the same t (remember JOIN conditions are different). The inverse Ei(e) is equally simple. Just return the same e. No extra indexes, no extra disk accesses. Just the identity function, as expressed by the JOIN conditions.

Now just for fun, lets try to go from the Extra table to the Business table. The database will have to synthesize the functions EtoB(e) ⇒ b and BtoE(b) ⇒ e. EtoB(e) can be written as a function composition: EtoB(e) = B(Ei(e)). That is: given e, find the t that matches and then with that t, find b. Similar steps must be taken to create BtoE(b).

For the Bad Model, Ei requires an index search while B just returns the idBusiness field of the specific Transaction record, but that means the database must read the record pointed to by e. For the Good Model, we know that both Ei and B are the identity function, so B(Ei(e)) = Identity(Identity(e)) = Identity(e) = e. No reads, no index searches. Again, just the identity function.

Lets go back to our example and put all this in practice. Imagine the subqueries get executed and we get 500 results from table Extra and 1100 results from table Business, having a total of 1 million records in table Transactions (no conditions on Transactions on our example query). To compute Transaction and Extra and Business, our generic database will do one the following:
  • Bad Model without indexes: must go through each of the 1 million records of Transactions to be able to match the Extra and Business records, because the relations E(t) and B(t) go in one direction and the Ei(e) and Bi(b) require to scan the whole Transaction table (called a "tablescan"). To find a matching record by field idExtra or idBusiness you would have to perform a tablescan, so starting with table Transaction is the same. This means at least 1 million Transaction reads, plus the Extra and Business record sorting and matching.
  • Bad Model with indexes: for each of the 500 results from table Extra, will have to do an index search to find the associated Transaction record to be able to match the corresponding Business record. Furthermore, the records on Business will not be in the same order that the records on Extra, so this requires a disk seek for each record. This means: 500 index searches on Transactions, 500 Transactions reads, 500 seeks on Business.
  • Good Model: because of the identity function, the server knows that it can safely compute   (Transaction and (Extra and Business)) without altering results. That is: consider the intersection operation to be associative. Calculating the Extra and Business part involves a forward-only record match between 500 and 1100 records (same key on both tables, therefore same order). The records on table Transactions will have to be read because it's the table in the SELECT and not checking primary key existence on Transactions will break query semantics. This all means: forward only match of 500 and 1100 records (no internal sorting) and 500 forward only reads on Transactions.

The Bad Model breaks associativity because the relational function to go from one table to the next is not the identity function. The Bad Model with indexes is not as bad, but still causes a lot of disk seeks (the most expensive operation on disks). The Good Model preserves associativity and that enables the server to generate huge savings in disk seeks and reads. Database implementations have many other optimizations for identity key match, hence that kind of match should be used whenever possible.

The fix that is not

The fix must be software only and it must involve both components of the application.

First, the web service that inserts records must recreate the Object to Data mapping to use the Good data model. The relation fields idExtra, idBusiness and idLowLevel can be left blank (NULL value). Foreign Keys, if present, must be dropped from the Database.

Second, the web application that queries records should be modified to be able to use the two data models, Good and Bad. The selection of the model to use is based on a system wide property that marks the moment in time where the Good data model began to be used. Because queries can span both models, a method of joining the results of querying both models in parallel must be implemented. The querying application is also able to do some basic reporting by returning aggregate counts. This reporting have to be changed to span both data models. Think about the complexity of calculating averages crossing different timespans.

These changes are a full blown project on their own, but the biggest problem is that future changes and feature adds to the application will duplicate the testing effort as they must be tested on both models.

A second alternative would be to create a new version of the application using the Good model and not to provide for model coexistence. With this approach, new installs will benefit, but your existing customers will be stuck with an unsupported version. This is not commercially viable, as this option implies to support two versions of the same application forever, duplicating development and testing.

A third option was briefly considered but quickly discarded. It involved sending a T-101 to the past.

The fix is that no fix would ever be applied. Data cannot be changed, Maintenance Costs cannot be increased. The problem is assumed to exist forever.

Conclusion

There are a number of conclusions:
  • Data Model and Object Model are two different things for a good reason. If you decide to force one onto the other, it better be a careful and informed decision.
  • Databases and ORMs are very complex tools; sometimes as complex as programming languages. It is your responsibility as a programmer to properly understand them. ORMs have been widely criticized or blindly used or both. SQL Databases have recently started to be very criticized too.
  • Local properties (Atomicity) do not imply global properties (Temporal Causality).
  • Once your application or product is deployed in a production environment, course corrections in the form of refactoring are not always possible. Many times logic is free to evolve while data is anchored to past versions of that logic.
  • In this particular case, the insert web service was programmed months before the query application, and by a different programmer. This translated into little initial engineering effort and set a bad course that is now impossible to correct.

This is long enough to be my fourth post.

PS: many thanks to Alejo Sanchez for reviewing early drafts of this post and making so many suggestions to make it better.

2010-09-07

Exception madness

Before exceptions became main-stream technology in programming languages (about 1.5 decades, i think), error control in programs was a delicate matter.

The problem was due to a common design pitfall that plagued (and still plagues) many technologies. That pitfall is named: in-band signaling.

Note: the name "in-band signaling" comes from the telecommunications industry (see wikipedia entry). The term comes from the fact that when you dial a number, the number itself is sent as sound over the line (you hear the tones, isn't it?). In-band signaling doesn't seem to be a problem, until you realize that other functions of the telecommunication network work the same way. Thus by knowing the right code, you can just dial it on your phone and lo and behold, the next call is not billed and things like that. That used to happen a lot in the 70', 80' and early 90' because of the modem cost of communications. I think it doesn't happen much today because once you pay for Internet access, the world is right at your fingertips.

Just think about it. If you don't have the exception infrastructure, the exceptions must be then returned as a special form of result of your method or function.

In languages like C, it is very common to see things like:

FILE *file = NULL;
if ( (file = fopen("/some_file", "w")) != NULL) {
     /* oops */
     extern int errno;
};

Notice that the function fopen is supposed to return a file handler (of type FILE *), but if an error occurs, then NULL is returned and you have to check the errno variable to see what went wrong.

Do you see that the error is sent in-band with the result?

This caused an awful lot of problems with many programs. Even to the point of using the most obnoxious pair of library calls in the C environment (be it setjmp and longjmp).

What did this two calls provide? In plain english: the ability for a program to set(jmp) a recovery point were bad situations could be handled and the ability for that program to (long)jmp to that point when bad things happened. I have to say that you ought to be a terrorist to use these two function calls, so controlling for errors in-band infected regular programming as badly as a venereal decease.

Perhaps you recognize that jumping back to a place where you know how to handle errors is just a tiny part of what the exception infrastructure provides in modern programming languages, but there is a lot more than just the jump in exception handling. That lot more is what made the pair setjmp/longjmp almost impossible to use properly. These jumps actually worked by just instantly moving the execution to a previous program scope, whereas exceptions destroy intermediate scopes (meaning: cleaning up stack objects and giving a chance to destroy manually allocated objects for languages without Garbage Collection).

All of a sudden, with exceptions you can write much cleaner code. If, for instance, fopen would return an exception if the file cannot be opened, the following code will make a lot of sense:

try {
    FILE *file = fopen("/some_file", "w");
    // do something with the file
} catch (FileNotExistsException ex) {
    // handle error
};


However...

Bad things happen

In our case, 3 bad things to be exact.

The first bad thing is that the code to handle an exception gets separated from the code originating it. This actually increases complexity in our programs. Just consider our last example source code and imagine that the code implied by the line // do something with the file is actually 150 lines long (or even 15 lines long). Then, the line that generates the exception on the catch clause is not immediately obvious. In this case, the exception name will give a clue, but with more obscure exception names, the relation is not evident, just implicit and that augments complexity. You can of course put a specific try/catch pair for each call that can generate an exception, but that makes for bigger code. And think that we have not even considered the finally construct that is expected to be used to undo object creation side-effects. A comment stating the source instruction for each catch clause will help with this issue, but not solve it and is extra programmer effort.

This leads to the second bad thing. Given that properly writing exception handling code is an extra effort, many programmers started to do one of two things: 1) put a generic catch (Exception e) clause, or 2) just not handling exceptions at all. The first case just looses the chance for fine grained error handling. A single exception is completely abortive of the function or method.

The second case degenerates into the third bad thing, specially for web services or web applications.

Enter the third bad thing. Considering bad thing number one and bad thing number two, proper exception handling is complex, so when we have a number of people writing code, the only we can count on is exceptions will not be properly handled. But don't despair. We can put a catch (Exception e) at the highest levels of our app and, at the worst, use an all fucked up generic error page. Even better, we can use the exception class to select a proper message to display, so it works like charm.

Nothing particularly bad with that implementation, i even like it and use this idea, but only if it is not a replacement for proper error handling.

The real (unintended) problem with this approach is that, when uncontrolled, it fosters...

Exception Madness

Why is that? Well, when programmers can count on having a safety net below, they tend to get lazy (we tend to get lazy). Now, you can just throw exceptions for almost any code you don't want to write.

For example, let's say you have an input field in a form that needs to be even (divisible by two). Some programmers (sadly not a few) will write a utility function or method called even that receives a generic object or integer and returns a boolean (true) if the value is even, but throw an exception if the value is odd. Of course, if you consider the method name there is no good reason for it to throw anything other that InvalidParameterType if it accepts general objects as input; but if you are going to throw your stomach contents on to your caller, at least have the decency to call the method something like ThrowsIfNotOdd.

These kind of situations are doubly perverse. First because they abuse the last resort exception handling and second because they do not put business or application logic in the right place (if you are going to abuse, then the caller should throw, not the utility function).

Conclusion

Exceptions are actually a step ahead in error handling, a step i am glad was introduced to many programming languages; but i think the problem is that once again, we got the silver bullet syndrome with them. Exceptions are an excellent tool, but only to the extent they are used properly, and they have some side effects too that you should be aware of.

Enough for a very very very late third post. Sorry about that.