Forum Chat


Mar23,13:31 Johan Marechal
Wees gegroet
Sep20,17:50 Vicente Duque
Kim, Martin, Others :...
Jul07,11:10 Johan Marechal
PGP 9
Jul05,21:13 martin
Fastest in the bush
Jul05,07:48 martin
Spamdexing
Jun28,21:16 martin
New domain / new blog!
Jun28,21:11 martin
On posting etiquette

Shadows and trails (Time Domain Addressing)

Comment on this article

This is a database pattern. I recently discovered a mention of something similar in Gray's Transaction Processing bible where it's named "Time Domain Addressing". Seems it was described in depth in 1978 by Dale Reed (or something similar) but not implemented in many databases as yet. Lots of problems to do it right, it seems. Since there was no in depth description, I can only assume my implementation is a small part of the whole. But a part that's good enough for me.

The problem (well, one of the problems) with a "straight" relational database is that it's always a timedependent snapshot. Example: you've got an invoices database with a customer table. Yesterday you change the address of one of the customers. Today you re-print an invoice from last month, and at the top of the invoice you see the new version of the address. In other words, there's no way you can re-print the invoice exactly as it looked when it was originally entered and printed a month ago.

Another example, more distressing this time. This morning, I get a lab result on patient X showing a hematocrit of 30% on a sample taken at 7 am (for you non-MDs out there, this is lowish), so I have the nursing staff give him a unit of blood. An hour later, the lab corrects that same result for the 7 am sample to 36% and even though I'm damned sure I actually saw "30%" on that screen earlier, now it says "36%" and I'm having a hard time convincing a collegue I actually did the right thing under the circumstances as they were then.

Of course, these kinds of problems can be solved piecemeal by having provisions in the respective tables for keeping a record of changes. But we're bound to provide such backup information only in the obvious places, not everywhere, unless we can set up a generalized mechanism for the entire database that takes care of this issue for all database entities. Can we? Yes we can, else this piece would have been pointless, right?

First, each table (and this means practically every table we've got in the database, except for tables that can never have updates or deletes) gets a second shadow-table with exactly the same structure as the "primary" table. Into this shadow table goes old versions of updated records and the most recent version of deleted records. The "primary" table, and the shadow table, both have a sequence number column added to be used for the trail system.

An extra table, the "trail" table, is added to the database. There's only one for the whole database. This table gains one row for each insert, update or delete on any other table in the system. A row here looks like:

Trail sequence nb Table number Primary sequence nb DateTime Session

 

Trail sequence nb A monotonously incrementing unique number
Table number: Identifies which primary table this row refers to. For instance "100", could be "invoices".
Primary sequence nb: The sequence number of the row in the primary table. 
DateTime Timestamps the entry. Generally useful to have.
Session Points to a record in a session table where the host, user etc can be held.

The trail list table is maintained by insert, update and delete triggers on the individual tables, making for performant implementation and the impossibility of applications to get around the system. Also, the applications need not be aware that the trail system even exists.

To illustrate how it works, let's go through an example. We'll add in "Albert", change his name to "Bertrand" and finally delete him. The trail table lacks the Session column since it's irrelevant to the example. Also, we assume "patients" to be table number 10. TSeq and PSeq mean Trail sequence nb resp. Primary sequence nb. DT means DateTime.

Step 1 - at 09:00 we add "Albert" to the system. 

Primary table Shadow table Trail
Key Value
1 Albert
Key Value
   
TSeq Tbl# PSeq DT
1 10 1 09:00

Step 2 - at 09:30 we rename "Albert" to "Bertrand".

Primary table Shadow table Trail
Key Value
1 Bertrand
Key Value
2 Albert
TSeq Tbl# PSeq DT
1 10 1 09:00
2 10 1 09:30

Step 3 - at 10:00 we delete "Bertrand" from the system.

Primary table Shadow table Trail
Key Value
   
Key Value
2 Albert
3 Bertrand
TSeq Tbl# PSeq DT
1 10 1 09:00
2 10 1 09:30
3 10 1 10:00

Note that any application that's oblivious to the system simply reads the primary table and chugs along happily. If you want to use the history data to find the name of a particular patient at any particular time, you can do that through a procedure working as follows (pseudo code):

sp_getPatOnDate(patKey, dtTime)
for each DT in Trail (PSeq = patKey)
  if (dtTime < DT)
    select from shadow where Key = TSeq
    return
  endif
endfor
select from Primary where Key = patKey
return

If you try this algorithm at noon (12:00) with different time values, you'll find:

dtTime returned remarks
08:45 NULL First entry in loop satisfies the condition and gives you TSeq = 1. Selecting for Key = 1 in shadow returns nothing.
09:15 Albert The second entry in the loop satisfies the condition and gives you TSeq = 2. Selecting for Key = 2 in shadow returns "Albert".
09:45 Bertrand The third entry in the loop satisfies the condition and gives you TSeq = 3. Selecting for Key = 3 in shadow returns "Bertrand".
10:15 NULL No entry in the loop satisfies the condition, so we drop through to the "select from Primary" which also returns nothing.

You can easily check that if the record had never been deleted, the last "select from Primary" would have picked it up correctly. The same goes for the situations where records have never been deleted or updated.

It's not hard to deduce the needed triggers and procedures from the above discussion. It's also easy to see that even though these triggers and procedures aren't trivial, they're very stereotypical and can easily be automatically generated for entire databases.

It's also blindingly obvious (to me, at least) that these operations belong in the database as triggers and stored procedures and not in a separate data access layer on the same or another machine. The reasons for this are several, including watertight control and performance.

There are several uses for this system:

  • History retrieval: an application can allow a user to pop down a list showing all changes to any value over time. (How to have the user find deleted values is more a user interface issue than a database issue.)
  • Rolling back the database to the exact state it had at any previous point in time. This can be implemented by views and/or stored procedures taking the "viewpoint time" as a parameter.
  • Replication of databases or exporting database updates: if any data export done includes the current highest trail sequence number, any updates only need to include operations found in later trail sequence numbers.

Comment on this article

TOP