Date formats

2013-04-12

Hilary Mason on twitter bemoaned the fact that matplotlib appears to use a floating point number of days to represent datetimes and suggested that “any other standard format” would be better. It is a little bit odd that a Python library uses that format, but it’s presumably because matlab does, and matplotlib is betraying its matlab heritage.

What other standards are there? Well, there’s POSIX time_t which is either an integer or a floating-point type (but actual practice seems to favour integer), and stores the number of seconds since the Unix Epoch, 1970-01-01T00:00:00Z. Not counting leap seconds. Not counting leap seconds is convenient for some calculations, but means there are (recent) times that cannot be represented (namely, any moment during a leap second). That’s not a good representation of time.

There’s another time format in use which is confusingly similar to POSIX’s time_t: it counts time using the number of seconds since the Unix Epoch. Including leap seconds. The opportunities for mistakes in conversion are endless. A problem with this time format is that times in the future (more than a couple of years into the future XX) are ambiguous. Since we don’t yet know if there are going to be leap seconds between now and 2020 we don’t know whether 2020-01-01T00:00:00Z is 1577836800 or a few seconds more or less than that.

An example confusion: “man 2 time” on my Linux box claims that time_t is the number of seconds since Epoch, but “date –date=2013-01-01 +%s” ends in “00” so clearly leap seconds aren’t being accounted for.

That’s twice now that I’ve used ISO 8601: 2013-04-09T06:06:06Z is roughly when I’m writing this paragraph. As a format for representing this times this has obvious benefits and equally obvious drawbacks. The benefit is that it’s pretty clear, and almost human readable (by humans like me). A drawback is that it takes a lot of space: 20 characters as I’ve given it. If you were storing this in a file you could reasonably use the “compact” form: 20130409060606Z, which is 15 characters, or 14 without the ‘Z’. Space misers could encode this in BCD and fit it in 8 bytes, which is no more than a 64-bit time_t.

Another drawback with ISO 8601 is that calculations with times can be a bit trickier. But on the other hand, you’re more likely to get correct answers, and be able to represent all (recent and future) times.

I’ve been working with sea-level and tide data recently and Hilary’s floating point number of days is certainly pretty convenient for that. The calculations are concerned with the average rotation of the earth, and it nicely represents that. Practitioners of the art also seem to like a slightly different time scale which is Julian centuries. Basically the same but scaled by 36525. It’s not unusual for the 0-point to be something like 1900-01-01T00:00:00, or even a midday like 1899-12-31T12:00:00.

What about representing times with sub-second accuracy? When Unix time_t type is an integer type then you’re stuck. ISO 8601 times can be extended with a decimal point and have as many decimal places as you like. So that’s a good representation. Hilary’s floating point number of days is also pretty good, as long as you’re using double precision. If you’re using double precision then even counting Julian centuries is okay, you still get sub-nanosecond precision. JavaScript’s representation is basically a floating point number that is the Unix time_t but in milliseconds not seconds. That too is pretty good.

I think Hilary’s format is actually pretty reasonable. Some calculations (involving counting whole numbers of days between times) are easier, some are tricky (but no trickier than any other format really). All times are representable. There’s some ambiguity about times in the future (that 1200 appointment you make in the year 2020 might actually turn up in your calendar at 12:00:00.5 if there is a leap second, but at least your midnight appointments won’t be on the wrong day). Yes, it’s a litle bit funky that on days with leap seconds the step between consecutive seconds will be different (it will in 1/86359 or 1/86401 instead of 1/86400), but I’m sure you’ll cope.

After a while, you just get used to seeing a new file format, another new date representation. According to the xlrd documentation Excel uses a floating point number of days since 1899-12-30T00:00:00, but since they thought 1900 was a leap year, this is only reliable after 1900-03-01. Unless the Excel file was made on a Mac, in which case the Epoch is 1904-01-01. The TOPEX/Jason data for global mean sea level represents dates as a decimal fraction of years.

The bottom line is: There are all sorts of crazy date formats. Get over it.

I recommend:

1) using a good library.
2) for storage, use ISO 8601
3) for runtime/manipulation use double.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: