Distributed Transactions and Commit Order Assumptions

September 28, 2008

We have quite a few applications that use, what amounts to, a hand rolled implementation of an ESB.

Typically, the system design includes several JMS queues which dispatch to a command chain for processing. The usage of the chain of responsibility pattern allows us to loosely couple components and processing logic.

In its simplest form, the system looks something like this:

Messages are received on the gateway and put on the associated JMS queue. The gateway can support a variety of pluggable protocols for inbound and outbound messaging(e.g. HTTP-REST, HTTP-WS, FTP, and even custom socket/udp based protocols). These gateway plugins are comparable with JBI Binding Components. Messages received by the inbound queue listener are dispatched to a processing chain which typically performs a variety of processing steps (which usually includes persisting some data to the database). At the end of the processing chain, a message is usually put on the reciprocal JMS queue constituting the reply to the original message.

The major difference between our application design and a canonical ESB is that our messages are persisted in the database while they are in flight and a unique identifier (e.g. PK) for the message stored in the database is passed through the system. An ESB, on the other hand, uses message passing semantics where the ENTIRE message is passed along.

Why is this distinction important?

Because the entire message is continually passed along between components in an ESB (via the Normalized Message Router), the receiver (e.g. JBI Service Engine) always has an up to date ‘view’ of the message. This is not always the case in our system and that can cause problems if we aren’t prepared for it.

We use an XA transaction to manage the JMS receive, any affiliated database updates and the terminating send of the message at the end of the processing chain. It’s easy, however, to misinterpret WHAT the XA semantics guarantee with HOW they operate in practice.

Consider the case in which we store a message in the database and send the PK for the newly inserted message to another JMS queue for subsequent processing. When the transaction commits, the transaction manager will eventually invoke commit() on each participant in turn. What is important to remember, however, is that the specification has no guarantees about WHEN the participants will be committed, only that when the transaction is complete, the state will eventually be consistent to an outside observer.

What if immediately after the transaction manager invokes commit() on the jms resource, the thread of execution is suspended? The receiver listening on that queue may very well receive the message and start processing it before the transaction manager ever even invokes the commit() on the database resource. As such, in typical transaction isolation semantics of Read Committed, this record will not be visible yet to the JMS receiver because it has not been committed yet by either the transaction manager or the underlying database resource. Assuming that the absence of the record indicated by the PK in the message is an unrecoverable error would be a programming error. It should be expected that this will occur and the system should allow for the message delivery to be rolled back and reattempted at some later date (when the database resource is synchronized with the rest of the system).

It is easy to be lulled into complacency in this scenario, because IN PRACTICE, you will rarely ever see the dreaded “record not found” error. This is because, in most situations, the commit happens so quickly on all the resources that the database IS in sync before the JMS listener receives the message. Furthermore, if you are using a non-XA database resource and relying on your transaction manager to emulate XA by using some variant of the Last Resource Gambit, this will almost always mean that the transaction manager commits the database resource BEFORE the JMS resource. However, relying on this to be the case is dangerous because we’ve come across at least one transaction manager (Atomikos) for which this didn’t hold true.

In summary, remember that XA transactions guarantee you WHAT the state of the system will be but make no guarantees on WHEN that will occur. Failure to observe this distinction may cause you sleepless nights when your software is in production.


3 Responses to “Distributed Transactions and Commit Order Assumptions”

  1. Guy Pardon Says:

    This has been changed by now – Atomikos latest releases respect commit ordering if you disable multi-threaded 2PC.


    “However, relying on this to be the case is dangerous because we’ve come across at least one transaction manager (Atomikos) for which this didn’t hold true.”

  2. heuristicexception Says:

    Thanks for the update Guy. Duly noted!!

  3. Guy Pardon Says:

    In turn: thanks for pointing out the problem 🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: