2010-10-12

(not) Messing up your Data Model because of your ORM

A few months ago, one of my clients asked me to take a look at the Data Model of one of his applications because of performance problems. A review showed database tables had no indexes, that being the cause of the performance hit.

Yet a deeper and complex structural problem surfaced upon further inspection. This second problem was, and still is, the result of lack of knowledge of both Databases and ORMs, so i thought it was worth sharing.

Refactoring an application usually isn't a big deal by itself. But when you have a Database involved, refactoring turns into a very complex problem beyond the realm of a single programmer, because data needs to be restructured. Moreover, when data is comprised of millions of records and over a terabyte of storage, the restructuring process can take months making it impractical and even forbidden by laws and regulations, as it is for this particular case.

The Application and the Implementation

The troubled application stores transactions in a database. It consists of a web service writing to the database and a web interfase to query inserted records. Transactions are composed of primary meta-data and three optional groups of data: Extra meta-data, Business data and LowLevel data. Each four of the data groups are stored on their own table. Tables hold historic information because it must be available for querying at any time and because regulations require it.

When development started, an Object Model was created for the write web service that could be described as:
Four classes to model a transaction:
  • class Transaction: holds primary meta-data
  • class Extra: holds extra meta-data
  • class Business: holds business data
  • class LowLevel: holds low level data
These classes are POJO objects; and there is also a DAO class to abstract database access. The classes Extra, Business and LowLevel are attributes of the class Transaction.
Here is when things started to run amok and reality and implementation started to diverge.

Note: remember Euclidean Geometry? What happens when even a very small divergence between two rect lines is propagated to a very large distance?

Let's see first how the Good Data Model should look like:

The Data Model
The Good Data Model
As the image shows, there should be a table named Transaction, with a monotonically increasing primary key (an auto-incremented id, mapped for example to an Oracle Sequence or an MS SQL Server IDENTITY). There should also be tree satellite tables (Extra, Business and LowLevel), each with just a primary key and, if you feel like it, a foreign key to the Transaction primary key. Notice that this design preserves the fact that satellite tables are dependent upon the primary table. This design is guided by what is called Normal Form in Database courses at colleges.

However, the implemented Data Model, we will call it Bad, looks like the following:

The Data Model
The Bad Data Model
As the image shows, now each table has an IDENTITY primary key. There are also three attributes on the Transaction table, each having the value of the primary key for the record in the corresponding satellite table.

I believe two separate things conspired to make the programmer create this Bad Data Model:
  • The Object Model makes the four classes (types) independent of each other. An instance of each class can exist on its own as you can call operator new on that class. Even when this makes syntactic and semantic sense for the Object Model, it doesn't reflect reality and it doesn't make syntactic nor semantic sense for the Data Model.
  • Not knowing the full set of capabilities of Hibernate ORM. When asked, the programmer's answer was that Hibernate cannot map things like the Good Data Model, even when it made sense from the Database point of view. Of course, Hibernate can properly map the right Data Model (see One-to-One bidirectional association on the Hibernate docs, specially the relation between the Person and PersonAddress tables).

The implemented Bad Data Model created two very serious structural problems. Let's try and understand how and why of these problems.

Temporal Causality no more

The implemented Bad Data Model has one very important side-effect related with the record ordering.

Given the structure of the Bad model, and because of performance reasons, the insert sequence looks like the following (this is just mockup code):

// create a transaction
hibernateTrans.create();

// insert into the ***SATELLITE*** tables
hibernateTrans.save(extra);
hibernateTrans.save(business);
hibernateTrans.save(lowLevel);

// set the satelite tables id in the main record
trans.setIdExtra(extra.getId());
trans.setIdBusiness(business.getId());
trans.setIdLowLevel(lowLevel.getId());

// save the main record
hibernateTrans.save(trans);

// commit the transaction
hibernateTrans.commit();

As you can see, the insert order is reversed compared to what common sense dictates: Transaction then satellites versus satellites then Transaction. This is because the Transaction table needs to store the references (IDs) to satellite tables and those are only available after the records are inserted. If the order is not reversed, then a later update must be executed on the already inserted Transaction record.

As long as you execute in a single threaded environment, such as development, you can guarantee that for two consecutive transactions T1 and T2 (and the satellite tables) the invariant: "T1.id < T2.id implies T1..Business.id < T2..Business.id" holds true, the same can be said for the other satellite tables. The invariant provides temporal causality across tables of the Data Model. It does so because as IDs are monotonically increasing you can safely infer time causality on a table: lower id means the insert happened before.

However, in a production environment, with the inserts running on multiple threads, that is no longer the case. To make it clear, imagine that transaction T1 is taken by thread T1 and that transaction T2 is taken by thread T2. The following image shows a possible sequence of execution:

Possible execution flow

The green arrow marks the execution flow between the two threads (the horizontal lines are thread switches). You must remember that the selected ID for each inserted record is decided by the Database, based on the last used ID for each specific table.

At the base of the image you can see the list of IDs assigned to each table: Transaction, Extra, Business, and LowLevel. As you can see, each list is a mixed set of values because the records are inserted in different order. Thus, the invariant does not hold.

This seemingly unimportant fact has very deep implications for a Database server, because it is custom practice that records are physically stored on disk in Primary Key order. Thus, when you JOIN rows from the Transaction table with one of the satellites, records on both tables will not be on the same order forcing the Database to perform the extra step of ordering. This ordering step can be executed: 1) in memory (consuming a lot of resources); or 2) on the fly by making disk seeks for each matching record (not cache friendly).

Now, contrast what happens with the Good Data Model. The sequence is to insert first the Transaction record and get the ID (primary key value). Then, the same ID is used to insert into the satellite tables. This holds the temporal causality invariant and forces the same order on all tables, thus enabling the database server to read records stored in the same physical order, in forward-only mode. This is very disk cache friendly and supports data read-ahead and data-prefetch algorithms.

Note: for those of you that have the "concurrent and parallel programming" gene, perhaps i should make explicit the fact that loosing temporal causality is due to the lack of a "critical section" primitive when accessing the database. While the "get the next id" operation is atomic, you must realize that the Begin/Commit/Rollback transaction construct only guarantees that the sequence is executed or not at all, it says nothing about executing undisturbed. In fact, for many versions of MSSQL Server a Rollback doesn't undo the "get next id" associated with an Identity field, making it forever consumed. I think a rollback doesn't undo an Oracle Sequence either.

Query optimization is severely affected

The second structural problem is perhaps more important, because it deeply affects query optimizer's ability to change query evaluation order. The best way to understand why is to think in terms of solving the following query:
SELECT E.field1, B.field2
  FROM Transaction T
  INNER JOIN Extra E
    ON T.idExtra = E.id AND /* conditions on Extra */
  INNER JOIN Business B
    ON T.idBusiness = B.id AND /* conditions on Business */

Note: the JOIN conditions for the Good Model will be different. For example, the Extra table JOIN will have T.id = E.id as condition instead of T.idExtra = E.id.

Despite many differences between specific database products, we can think a generic database server will solve this by computing one partial temporary result (subquery) for each table (Transaction, Extra and Business). The server will then proceed to calculate the intersection between all the matching temporary results. In Relational Algebra this matching is called a Natural Join and can be thought of as a set intersection () or and operator. This last step can be represented then by the expression: Transaction and Extra and Business.

A lot of the biggest optimizations happen when the server calculates this intersection. To understand how this happens and why, it is handy to think there are a set of functions that relate different tables of the Data Model. These functions have a single parameter (the Primary Key in the source table) and return the Primary Key of the associated record in another table. The next image represents these functions:

List of Functions to relate tables in the Data Model

Note that there are a pair of functions between tables. For example, between Transaction and Extra table there are E(t) ⇒ e and the inverse Ei(e) ⇒ t. There are also relations between the satellite tables of the model. For clarity, the image shows only the relation functions between the Extra and Business table. Lets examine first the functions E and Ei.

For the Bad Model, E(t) is very simple. Just return the field T.idExtra (see the JOIN condition of the example query). The inverse Ei(e) is more complex. You start with the primary key of Extra record (e) and then you must find the primary key of the associated Transaction record. To do that you must either: 1) read every Transaction record to find the one with idExtra = e, or 2) create an index on the idExtra field. The index is obviously faster, but it still requires some disk accesses.

For the Good Model, the function E(t) is also simple. Just return the same t (remember JOIN conditions are different). The inverse Ei(e) is equally simple. Just return the same e. No extra indexes, no extra disk accesses. Just the identity function, as expressed by the JOIN conditions.

Now just for fun, lets try to go from the Extra table to the Business table. The database will have to synthesize the functions EtoB(e) ⇒ b and BtoE(b) ⇒ e. EtoB(e) can be written as a function composition: EtoB(e) = B(Ei(e)). That is: given e, find the t that matches and then with that t, find b. Similar steps must be taken to create BtoE(b).

For the Bad Model, Ei requires an index search while B just returns the idBusiness field of the specific Transaction record, but that means the database must read the record pointed to by e. For the Good Model, we know that both Ei and B are the identity function, so B(Ei(e)) = Identity(Identity(e)) = Identity(e) = e. No reads, no index searches. Again, just the identity function.

Lets go back to our example and put all this in practice. Imagine the subqueries get executed and we get 500 results from table Extra and 1100 results from table Business, having a total of 1 million records in table Transactions (no conditions on Transactions on our example query). To compute Transaction and Extra and Business, our generic database will do one the following:
  • Bad Model without indexes: must go through each of the 1 million records of Transactions to be able to match the Extra and Business records, because the relations E(t) and B(t) go in one direction and the Ei(e) and Bi(b) require to scan the whole Transaction table (called a "tablescan"). To find a matching record by field idExtra or idBusiness you would have to perform a tablescan, so starting with table Transaction is the same. This means at least 1 million Transaction reads, plus the Extra and Business record sorting and matching.
  • Bad Model with indexes: for each of the 500 results from table Extra, will have to do an index search to find the associated Transaction record to be able to match the corresponding Business record. Furthermore, the records on Business will not be in the same order that the records on Extra, so this requires a disk seek for each record. This means: 500 index searches on Transactions, 500 Transactions reads, 500 seeks on Business.
  • Good Model: because of the identity function, the server knows that it can safely compute   (Transaction and (Extra and Business)) without altering results. That is: consider the intersection operation to be associative. Calculating the Extra and Business part involves a forward-only record match between 500 and 1100 records (same key on both tables, therefore same order). The records on table Transactions will have to be read because it's the table in the SELECT and not checking primary key existence on Transactions will break query semantics. This all means: forward only match of 500 and 1100 records (no internal sorting) and 500 forward only reads on Transactions.

The Bad Model breaks associativity because the relational function to go from one table to the next is not the identity function. The Bad Model with indexes is not as bad, but still causes a lot of disk seeks (the most expensive operation on disks). The Good Model preserves associativity and that enables the server to generate huge savings in disk seeks and reads. Database implementations have many other optimizations for identity key match, hence that kind of match should be used whenever possible.

The fix that is not

The fix must be software only and it must involve both components of the application.

First, the web service that inserts records must recreate the Object to Data mapping to use the Good data model. The relation fields idExtra, idBusiness and idLowLevel can be left blank (NULL value). Foreign Keys, if present, must be dropped from the Database.

Second, the web application that queries records should be modified to be able to use the two data models, Good and Bad. The selection of the model to use is based on a system wide property that marks the moment in time where the Good data model began to be used. Because queries can span both models, a method of joining the results of querying both models in parallel must be implemented. The querying application is also able to do some basic reporting by returning aggregate counts. This reporting have to be changed to span both data models. Think about the complexity of calculating averages crossing different timespans.

These changes are a full blown project on their own, but the biggest problem is that future changes and feature adds to the application will duplicate the testing effort as they must be tested on both models.

A second alternative would be to create a new version of the application using the Good model and not to provide for model coexistence. With this approach, new installs will benefit, but your existing customers will be stuck with an unsupported version. This is not commercially viable, as this option implies to support two versions of the same application forever, duplicating development and testing.

A third option was briefly considered but quickly discarded. It involved sending a T-101 to the past.

The fix is that no fix would ever be applied. Data cannot be changed, Maintenance Costs cannot be increased. The problem is assumed to exist forever.

Conclusion

There are a number of conclusions:
  • Data Model and Object Model are two different things for a good reason. If you decide to force one onto the other, it better be a careful and informed decision.
  • Databases and ORMs are very complex tools; sometimes as complex as programming languages. It is your responsibility as a programmer to properly understand them. ORMs have been widely criticized or blindly used or both. SQL Databases have recently started to be very criticized too.
  • Local properties (Atomicity) do not imply global properties (Temporal Causality).
  • Once your application or product is deployed in a production environment, course corrections in the form of refactoring are not always possible. Many times logic is free to evolve while data is anchored to past versions of that logic.
  • In this particular case, the insert web service was programmed months before the query application, and by a different programmer. This translated into little initial engineering effort and set a bad course that is now impossible to correct.

This is long enough to be my fourth post.

PS: many thanks to Alejo Sanchez for reviewing early drafts of this post and making so many suggestions to make it better.