SQL Timebuckets

Home • Forum • RSS • PGP • Alerts • Links • (D)

Timebuckets in SQL

If you have an application that collects events of some kind, you're likely to want to see a timebased graph of those events sooner or later. In my application, I'm receiving messages from clients over HTTP through several servers ("front-ends"). To see how well the front-ends handle the load, I want to be able to present a graph over how many messages are processed per front-end per time interval. That time interval can be seconds, minutes, hours, days, weeks, months or years or almost any intermediate period. The user should be able to select the resolution and the span and get that graph on to the screen in a reasonable amount of time. If that time was short enough, and the load on the server low enough, I could even present a real-time scrolling graph. Another requirement is that collecting the data should not cause much extra load as the messages are coming in through the front-ends.

The database needs to retrieve the event times as "time-buckets". A "time-bucket" is a time interval and the count of events within that interval. The size (time length) of a bucket is set by the user in the query and the query returns a rowset with the intervals and counts. It should all be lightning quick, of course, and not worry the CPU much. I tried a number of methods that didn't work, but I'll summarize them anyway so you know what not to try.

Make buckets during select

Just construct the data with a "GROUP BY" statement and some function similar to the fTimeKey() function (outlined in the next section). This will cause a temporary table to be created by the SQL processor and a very high load on the CPU. If we want to pick up performance data on a continuous basis, such as every 10 seconds, this will be totally unworkable. Simply forget I even mentioned it.

Bucketing while getting messages

One possibility is to provide the table that keeps the messages with an INSERT trigger that updates a counter for the smallest possible bucket interval, or for several intervals. This trigger will need to call a stored procedure similar to the following:

IF EXISTS(SELECT * FROM TimeBuckets
           WHERE Host= @Host
             AND TimeKey = dbo.fTimeKey(@DateTime))
   INSERT TimeBuckets (TimeKey, Host, Cnt)
   VALUES (dbo.fTimeKey(@DateTime), @Host, 1)
ELSE
   UPDATE TimeBuckets
      SET Cnt = Cnt + 1
    WHERE TimeKey = dbo.fTimeKey(@DateTime)
      AND Host = @Host

The dbo.fTimeKey() user function (you can only create user functions in MSSQL 2000 and later) returns a string that depends on the size of the timebuckets you want. For instance, if you want time buckets holding a full second of data, the returned key could be: "20040220 113108". If you want buckets the size of minutes, the string for the same datetime value would be: "20040220 113100", i.e. all times during that minute would be rounded off to the same key value.

There are a number of problems with this approach. The first problem is the abominable performance. If you've got anything like a realistic load on those front-ends, this this is unworkable. Just try it and you'll see. The second problem is that you have to decide on the timebucket size beforehand. So if you want to retrieve data in larger timebuckets, you have to do collate and sum the data in the query. That, in turn, will be horrible. Or, finally, you could extend the stored procedure to post to different sized buckets during the initial write. And that is even worse (I tried).

Other variations on the above theme, such as inserting all the needed buckets with counts = 0 first, then entering the count values using a (hairy) correlated subquery don't improve things noticeably. I don't doubt for a second that this approach can be improved considerably, but I don't think it ever will be good enough to be practical.

Bucketing after the fact

You can apply the same principle as above, but disconnect it from the initial saving of data from the front-ends. That would liberate the front-ends to run at their full capacity and would allow you to sort the event datetimes into buckets when the server has spare capacity. This method would imply that you need to maintain datetime "milestones", that is, values showing how far into the event tables you've processed the data. You also would need a mechanism to run processing in the background. I'm using an autostarted procedure in the master database for these things, that contains a WAITFOR statement and that calls a stored procedure in my database once a second. I could have that procedure check for new events to put into buckets.

Problems (and this one has more than enough): there's no way I know of to have this stored procedure recognize that the server is a bit too busy to bother with bucketizing right now. Or to have a stored procedure run with a lower priority. If anyone knows, please tell me about it. So this procedure may still very well take processor capacity when we really can't afford it. Maintaining the milestones isn't trivial either. Selecting from the event tables according to the milestones may be very inefficient (we need the last second of data from a table containing maybe years of data...), so we may need to have separate small tables for unprocessed data. And finally, if we select data for a graph, we won't get the most recent data every time.

It's really simple, actually

Since nothing worked, I started over. As simple as possible, meaning playing with indexes. I created a table containing nothing but the events themselves:

CREATE TABLE [dbo].[TimedEvent] (
   [Event] [char] (16) NOT NULL ,
   [Host] [char] (16) NOT NULL ,
   [DateTime] [datetime] NOT NULL ,
   [YYYY] AS (datepart(year,[DateTime])) ,
   [MM] AS (datepart(month,[DateTime])) ,
   [DD] AS (datepart(day,[DateTime])) ,
   [HH] AS (datepart(hour,[DateTime])) ,
   [NN] AS (datepart(minute,[DateTime])) ,
   [SS] AS (datepart(second,[DateTime])) ,
   [DW] AS (datepart(weekday,[DateTime]))
)
CREATE CLUSTERED INDEX [IX_TimedEvent] ON [dbo].[TimedEvent]([YYYY], [MM], [DD], [HH], [NN], [SS])

This table has an "Event" column where you can stuff things like "Received", "Encrypted", "Sent" or whatever classes of events you have that you want to keep tabs on. The "Host" column contains the particular front-end server I want to have data on. "DateTime" contains the exact datetime of the event. The really important thing happens in the [MM] ... [DW] columns. These are all defined as computed columns, so they're automatically filled in by the system. As you can see, they contain the different parts of the datetime, including a day of the week column ([DW]).

Then I added a single clustered index, composed of the series of computed columns, except for the [DW] column. And finally, I changed the stored procedure that writes new data for the front-ends to also insert a row into the TimedEvent table for every event. Alternatively, if you have only one or a few event tables you want to track, you could add the computed columns directly to that table itself.

There's one disadvantage with the above computed columns: all computed columns are "integer" columns, since that is the return value type of the DATEPART() function. If you have millions of rows, the wasted space is becoming significant. I'm going to change the above to calculations done in an INSERT trigger and change the column types to the smallest possible (tinyint or byte, for example) in a later stage and then I'll see if it's worth it.

Performance: amazing

I ran queries against the TimedEvent table as it looks above, with 452990 rows of data in it. The machine running the SQL Server 2000 is a dual processor 800 MHz (it's a few years old) with 512 Mb RAM and a pretty good SCSI drive. I was running the queries using "SQL Query Analyzer" from a workstation connected to the server over a 100 MBit/s LAN. As timer, I used the timer in the status bar of the "SQL Query Analyzer".

In the first select, we retrieve the totals of the events per year. I have events spread over three years and four hosts. This select returns after less than one second:

SELECT COUNT(*), HOST
FROM TimedEvent
GROUP BY HOST, YYYY

The next select is very similar, but returns the total counts per weekday. Note that the table isn't even indexed by weekday, but still it takes less than a second (why, I don't know...):

SELECT COUNT(*), HOST
FROM TimedEvent
GROUP BY HOST, DW

The following select is much larger. It returns an ordered rowset containing all the counts per minute of all the servers separately. The returned rowset contained about 88000 rows and it took... 7 seconds to complete!

SELECT COUNT(*), HOST, YYYY, MM, DD, HH, NN
FROM TimedEvent
GROUP BY HOST, YYYY, MM, DD, HH, NN
ORDER BY HOST, YYYY, MM, DD, HH, NN

Finally, the following select returns all counts per minute of a 6 day period one month back in time. In my case, it returned 2541 rows and took less than one second to execute:

SELECT   COUNT(*), HOST, YYYY, MM, DD, HH, NN
FROM     TimedEvent
WHERE    DateTime >= GETDATE() - 30 AND DateTime <= GETDATE() - 24
GROUP BY HOST, YYYY, MM, DD, HH, NN
ORDER BY HOST, YYYY, MM, DD, HH, NN

To get an idea of the processing needed to write to the TimedEvent table, I filled the table using the following query:

INSERT TimedEvent (Event, Host, DateTime)
SELECT 'Ticket', ISNULL(Host, '-'), DateTime
FROM Tickets

This query took 30 seconds to insert 452990 rows of data. Which leads me to assume it won't be a problem...<g>.

Comment on this article

TOP