classy enumerations

2014-02-17

An enumeration is a term that usually refers to a type consisting of a finite number of distinct members. The members themselves can be tested for equality, but usually their particular value is not important.

Maybe you’re modelling a Sphex wasp and you have a state variable with values NOTNEARHOME, JUSTLEFTHOME. You could represent that with an enumeration.

In C the enum keyword assigns (usually small) integers to constant identifiers. It is problematic, chiefly because the members of the enumeration are really just integers. After enum { Foo; }, then code like Foo / 5 is valid (note: valid, not sensible).

In Python you could do essentially the same thing:

NOTNEARHOME = 0
JUSTLEFTHOME = 1

if self.state == NOTNEARHOME:
    if 'spider' in self.inventory:
        # head towards home
    else:
        # look for juicy spiders

You do see this style (ast.PyCF_ONLY_AST Note 1), but it has the same problems as enum in C. The values are just integers (so, for example, print self.state will print 0, or 1).

You could use strings (like decimal.ROUND_HALF_EVEN):

NOTNEARHOME = 'notnearhome'
# and so on...

That’s better because now I might have a clue if a print out a value and it’s 'notnearhome', but it’s only a little bit better, because you still might accidentally use the value innappropriately (opt = decimal.ROUND_HALF_EVEN; opt += 'foo').

I have a proposal:

Use a class!

class NOTNEARHOME: pass         # Note 2
class JUSTLEFTHOME: pass

Let’s call this classy enumerations.

Classy enumerations have the advantage that we don’t need to manually assign numbers or strings. Values like Mood.PUZZLED and Mood.CONFUSED (which are actually classes) will be unique, so can be tested using == or is correctly.

With classy enumerations we get an error if we accidentally use them in an expression:

>>> PUZZLED+1
Traceback (most recent call last):
  File "", line 1, in 
TypeError: unsupported operand type(s) for +: 'classobj' and 'int'

And to wrap up:

class True: pass
class False: pass

This article was inspired by looking at some of Peter Waller‘s code who seems to have invented the idea of using classes for enumerations.

Note 1

Yes this value matches a value in the C header file. Maybe that has some merit, but it doesn’t make for a very Pythonic interface.

Note 2

The body of a class definition is just a block of code. When that body is just a simple statement, it can share the line with the class declaration. Hence, class NOTNEARHOME: pass is a compact complete class definition. If you’re in a mood for documentation, replace “pass” with a docstring.


Teach everyone to program

2014-02-05

The microcomputer was invented only a generation ago, it is now in hundreds of devices and gadgets in every home.

We are on the brink of a revolution. A revolution as significant as the printing press. It was hundreds of years between the press and universal literacy (still not quite achieved in the UK, our adult literacy rate is 99%), but we now live in a society where so much information is written and so much commerce and social interaction takes place in writing that you are at a severe disadvantage if you cannot read and write. We are not all writers, but we can all write. Poets can move entire Nations with carefully crafted written words, but even if most of us can’t achieve that, we can at least write a note to our milkman asking for 2 more pints on Saturday (Note 1).

Imagine if Gove (Note 2) suggests that we only teach the gifted to write. Only the playwrights, the speechwriters, the journalists, and the poets will write. After all, all the good stuff is written by them anyway. I hope you can see that this would be madness.

I feel the same way about code and programming. There is a poetry to code; the poets of programming write code with concision and precision: the structure writ clear on the fan-folded page. There is already a rich literature of programming. github overflows with the pulp fiction of the professional and amateur hack alike; likely we will find Lovecraftian horrors there too, lurking, ready to turn our minds into a pretzel. But at the moment this culture is the culture of an elite class. Everyone should be able to program. Most people will not be poets. Most people will not be programmers, but that should not stop us from teaching them to program.

In the future code will be woven into the fabric of our society, just as the written word is woven now. We don’t teach people to read and write because it will be helpful to them in their future career. We teach them because it is inconceivable that they can function without basic literacy. It is inconceivable that in the future we will be able to function without basic coding.

This is why I’m excited about teaching kids to program. Lots of grassroots initiatives to teach programming, like Software Carpentry, Raspberry Pi, Young Rewired State. Let’s build the future now. Teach everyone to program.

Note 1

I live in a quaint postcode where I can still get milk delivered to my doorstep. And I do so.

Note 2

Hello readers from the future! Gove was responsible for education policy in the United Kingdom (Note 3) for a brief period in the early 21st Century.

Note 3

Hello readers from the more distant future. The United Kingdom was a nation consisting of various bits of islands in an archipelago off the North West coast of Europe.


Women of #movember

2013-11-05

In the interests of brevity the title says “women” but I mean to include all partners of people growing a mo’ this month.

I am growing a moustache to support a local prostate cancer group.

The reality is that growing a moustache is not hard. I just have to not shave a certain part of my face for a while. Most of the hardship is endured by my partner Philippa who has to put up with bristly kisses for a month.

Support is vital for success. Support for survivors of cancer, and support for sufferers of cancer. Support for people raising money.

I’d like to thank all the partners out there that are supporting their men growing a moustache. I would not be able to do this without the support of my family and friends, and I’m sure others feel similarly.

You can give your support here: https://www.justgiving.com/drj11.


Explaining p += q in Python

2013-10-29

If you’re a Python programmer you should know about the augmented assignment statements:


i += 1

This adds one to i. There is a whole host of these augmented operators (-=, *=, /=, %= etc).

Aside: I used to call these assignment operators which is the C terminology, but in Python assignment is a statement, not an expression (yay!): you can’t go while i -= 1 (and this is a Good Thing).

An augmented assignment like i += 1 is often described as being the same as i = i + 1, except that i can be a complicated expression and it will not be evaluated twice.

As Julian Todd pointed out to me (a couple of years ago now), this is not quite right when it comes to lists.

Recall that when p and q are lists, p + q is a fresh list that is neither p nor q:


>>> p = [1]
>>> q = [2]
>>> r = p + q
>>> r
[1, 2]
>>> r is p
False
>>> r is q
False

So p += q should be the same as p = p + q, which creates a fresh list and assigns a reference to it to p, right?

No. It’s a little bit tricky to convince yourself of this fact; you have to keep a second reference to the original p (called op below):


>>> p = [1]
>>> op = p
>>> p += [2]
>>> p
[1, 2]
>>> op
[1, 2]
>>> p is op
True

Here it is in pictures:
before.dot
fresh.dot
augment.dot

Because of this, it’s slightly harder to explain how the += assignment statement behaves. For numbers we can explain it by breaking it down into a + operator and an assignment, but for lists this explanation fails because it doesn’t explain how p (in our p += q example) retains its identity (the curious will have already found out that+= is implemented by calling the __iadd__ method of p).

What about tuples?

When p and q are tuples the behaviour of += is more like numbers than lists. A fresh tuple is created. It has to be, since you can’t mutate a tuple.

This kind of issue, the difference between creating a fresh object and mutating an existing one, lies at the heart of understanding the P languages (perl, Python, PHP, Ruby, JavaScript).

The keen may wish to fill in this table:

p q p + q p += q
list list fresh p mutated
tuple tuple fresh fresh
list tuple ? ?
tuple list ? ?

Sunrise

2013-10-16

One of the benefits of getting the train from Sheffield to Liverpool is the beautiful views of the Peak District that I don’t pay enough attention to.

Winter is coming, consequently, it’s just the right time to see the sunrise:

copper bronze dashed across the sky
blots and streaks of morning’s blue
a speck of an aeroplane drifted lazily
the buildings of flats and industry were silhouetted
brick chimneys thrust industrially upward
we arrived in Manchester.


On compiling 34 year old C code

2013-09-01

The 34 year old C code is the C code for ed and sed from Unix Version 7. I’ve been getting it to compile on a modern POSIXish system.

Some changes had to be made. But not very many.

The union hack

sed uses a union to save space in a struct. Specifically, the reptr union has two sub structs that differ only in that one of them has a char *re1 field where the other has a union reptr *lb1. In the old days it was possible to access members of structs inside unions without having to name the intermediate struct. For example the code in the sed implementation uses rep->ad1 instead of rep->reptr1.ad1. That’s no longer possible (I’m pretty sure this shortcut was already out of fashion by the time K&R was published in 1978, but I don’t have a copy to hand).

I first changed the union to a struct that had a union inside it only for the two members that differed:

struct	reptr {
		char	*ad1;
		char	*ad2;
	union {
		char	*re1;
		struct reptr	*lb1;
        } u;
		char	*rhs;
		FILE	*fcode;
		char	command;
		char	gfl;
		char	pfl;
		char	inar;
		char	negfl;
} ptrspace[PTRSIZE], *rep;

The meant changing a few “union reptr” to “struct reptr”, but most of the member accesses would be unchanged. ->re1 had to be changed to ->u.re1, but that’s a simple search and replace.

It wasn’t until I was explaining this ghastly use of union to Paul a day later that I realised the union is a completely unnecessary space-saving micro-optimisation. We can just have a plain struct where only one of the two fields re1 and lb1 were ever used. That’s much nicer, and so is the code.

The rise of headers

In K&R C if the function you were calling returned int then you didn’t need to declare it before calling it. Many functions that in modern times return void, used to return int (which is to say, they didn’t declare what they returned, so it defaulted to int, and if the function used plain return; then that was okay as long as the caller didn’t use the return value). exit() is such a function. sed calls it without declaring it first, and that generates a warning:

sed0.c:48:3: warning: incompatible implicit declaration of built-in function ‘exit’ [enabled by default]

I could declare exit() explicitly, but it seemed simpler to just add #include <stdlib.h>. And it is.

ed declares some of the library functions it uses explicitly. Like malloc():

char	*malloc();

That’s a problem because the declaration of malloc() has changed. Now it’s void *malloc(size_t). This is a fatal violation of the C standard that the compiler is not obliged to warn me about, but thankfully it does.

The modern fix is again to add header files. Amusingly, version 7 Unix didn’t even have a header file that declared malloc().

When it comes to mktemp() (which is also declared explicitly rather than via a header file), ed has a problem:

tfname = mktemp("/tmp/eXXXXX");

2 problems in fact. One is that modern mktemp() expects its argument to have 6 “X”s, not 5. But the more serious problem is that the storage pointed to by the argument will be modified, and the string literal that is passed is not guaranteed to be placed in a modifiable data region. I’m surprised this worked in Version 7 Unix. These days not only is it not guaranteed to work, it doesn’t actually work. SEGFAULT because mktemp() tries to write into a read-only page.

And the 3rd problem is of course that mktemp() is a problematic interface so the better mkstemp() interface made it into the POSIX standard and mktemp() did not.

Which brings me to…

Unfashionable interfaces

ed uses gtty() and stty() to get and set the terminal flags (to switch off ECHO while the encryption key is read from the terminal). Not only is gtty() unfashionable in modern Unixes (I replaced it with tcgettattr()), it was unfashionable in Version 7 Unix. It’s not documented in the man pages; instead, tty(4) documents the use of the TIOCGETP command to ioctl().

ed is already using a legacy interface in 1979.


CoffeeScript is 4 times shorter than C

2013-08-30

Compare FreeBSD’s sed.c (implemented in C) with drj’s sed.js (implemented in CoffeeScript).

A size comparison

              lines bytes
C             2471  62k
CoffeeScript  533   14k

That right there is CoffeeScript’s chief advantage over C. The CoffeeScript is 4 times smaller, meaning CoffeeScript programs are likely to be faster to write, easier to maintain, and contain fewer bugs.

Is it an apples to apples comparison? The FreeBSD sed implements a couple more options than my mostly POSIX compliant sed.js but they don’t pad it out much. Neither implementation includes code to handle Regular Expressions. sed.c uses the Unix library and sed.coffee uses JavaScript’s built-in RegExp class. So I think it’s a pretty reasonable comparison.

What makes the C code bigger? There is a lot of memory management, and an awful lot of string copying and substring extraction. There is an implementation of a hash table (for labels).

Obviously we all knew that high level languages resulted in smaller programs, but it’s rare to get the opportunity to do such a direct comparison.


sed, POSIX, and Node.js

2013-07-11

I’ve been implementing sed. A POSIX compatible sed in Node.js.

It just seemed to me that one day soon the world will need a suite of Unix utilities written in Node.js. And I shall be The One.

The experience has made me a bit sad about the POSIX spec. There are problems. For example, it’s not very good at documenting the actual or desired behaviour of classic Unix utilities:

sed has a D command.

This command deletes the initial portion of the pattern space up to the first newline (which may be the entire pattern space if a newline has not been introduced with an editing command or with N); then D begins a new cycle. At the start of this new cycle, the next line of input is loaded into the pattern space, but ONLY IF THE PATTERN SPACE IS EMPTY.

This last bit is missing from the 2004 edition of the POSIX spec. It’s fixed and documented correctly in the 2013 edition of the POSIX spec.

The behaviour of sed hasn’t changed since Version 7 in 1979. The D command has always skipped appending input (if there was anything left in the pattern space). Probably no sed ever had its D command behave in the way documented in the 2004 POSIX spec. Maybe if someone was to try building a version of sed from scratch using the 2004 POSIX spec and without reference to any other sed implementations. But who would be mad enough to do that?

At some point someone drafting the POSIX spec didn’t notice the actual behaviour of sed, made a mistake in documenting the behaviour of its D command, and noone noticed until 2013 (well, a few years before, presumably). Which brings me to…

The pace of change is glacial.

Another thing about the POSIX spec which saddened me a little was the way all sorts of bizarre, obscure, and not very useful behaviours get documented and locked-in. You knew that sed has a ! modifier that negates an address. So «sed -n /barf/!p» prints all the lines that do NOT match /barf/. Did you know you can have as many ! as you like? «sed -n /barf/!!!!p» has the same behaviour as the previous program. At least according to the spec, and the Version 7 sed that I tried. There’s no point to this. No real program relies on this behaviour, yet there it is in the spec, so you have to implement it (if you want to comply to the spec). GNU sed (popular on Linux) gives an error instead. Which brings me to…

You can’t really rely on what you read in the spec being implemented.

or…

GNU feel free to depart from the spec whenever they see fit to do so.

sed is a bit weak. For example, its regular expressions (POSIX Basic Regular Expressions) don’t even support «|» for alternation. POSIX has Extended Regular Expressions. Wouldn’t it be sensible to move towards adding Extended Regular Expressions to all the tools that only had Basic Regular Expressions? Well, maybe yes, but there seems to be no taste for doing that in a POSIX committee. And remember…

The pace of change is glacial.


Let me start in a way that i

2013-06-07

Let me start in a way that i hope you find controversial.

I am against gay marriage. The state should not recognise the marriages of gay couples.

But fear not, I’m a nice liberal Guardian-reading libdem-labour-green swing voter. I am not about to join the UKIP. To expand upon my position:

I am against marriage. The state should not recognise the marriages of couples.

Marriage is a contract between 2 people (and god, if you choose to believe). I don’t see why the state needs to get involved.

There are various rights and responsibilities that come with marriage, but in all cases i think it would be better if these were separated:

- married fathers have more rights (for example, direction of medical care) over their children than unmarried ones. That’s just silly.

- married couples have various tax advantages (rarely, disadvantages) depending of the whims of the current government. It really depends what the policy is here. Is it to encourage people to form lasting bonds, or is it to recognise that people who stay at home to manage a household deserve a tax break to recognise their economic contributions and compensate for their lack of earnings? Two sisters share a home where one has a full-time job and the other looks after the house and does a few hours teaching a week (maybe they co-parent a child). Should they not be able to pool their tax-free allowance? Peoples lives are entangled in all sorts of ways that the “married couple” paradigm does not recognise.

- married couples have access to the divorce courts. Unmarried couples would benefit from the guidance and impartiality of the divorce courts too. Separation is stressful, messy, confusing thing to go through, and of course the divorce court does not wave a magic wand and make it all unicorns and ponies, but i think it is helpful, and it would be helpful to all couples.

- the estate is presumed to pass to the surviving widow when one of a married couple dies.

It’s not that i don’t think the state should be doing these things, in most cases i think they should. It’s that i don’t think these things should be bundled up in Marriage. Sharing your tax-free allowance should be a matter of filing a form with the tax office. Letting your estate pass to another on your death should again, just be a matter of filing a form.

Asserting paternal rights over a child is a lot more messy. Should it matter if you’re not the genetic father of a child, if you’re the one providing the home and investing in the child’s development? Should the mother be allowed to prevent access to a supposed father? I have no idea, but whether a couple is married or not should not matter.

Public recognition. Of course people should be allowed to marry and be married. Public recognition is one of the most important aspects of marriage. But this recognition comes from your peers, not from the state. Couples have been getting married for thousands of years, long before governments came along to record the marriages.

Marriage as a union for life between a man and a woman is a christian thing. Marriage as actually practiced over the last 5000 years or so is much more diverse. We should embrace that diversity (once again). Capturing this in leglisation would be a complicated nightmare.

The simplest thing the state can do is not get involved.


Date formats

2013-04-12

Hilary Mason on twitter bemoaned the fact that matplotlib appears to use a floating point number of days to represent datetimes and suggested that “any other standard format” would be better. It is a little bit odd that a Python library uses that format, but it’s presumably because matlab does, and matplotlib is betraying its matlab heritage.

What other standards are there? Well, there’s POSIX time_t which is either an integer or a floating-point type (but actual practice seems to favour integer), and stores the number of seconds since the Unix Epoch, 1970-01-01T00:00:00Z. Not counting leap seconds. Not counting leap seconds is convenient for some calculations, but means there are (recent) times that cannot be represented (namely, any moment during a leap second). That’s not a good representation of time.

There’s another time format in use which is confusingly similar to POSIX’s time_t: it counts time using the number of seconds since the Unix Epoch. Including leap seconds. The opportunities for mistakes in conversion are endless. A problem with this time format is that times in the future (more than a couple of years into the future XX) are ambiguous. Since we don’t yet know if there are going to be leap seconds between now and 2020 we don’t know whether 2020-01-01T00:00:00Z is 1577836800 or a few seconds more or less than that.

An example confusion: “man 2 time” on my Linux box claims that time_t is the number of seconds since Epoch, but “date –date=2013-01-01 +%s” ends in “00” so clearly leap seconds aren’t being accounted for.

That’s twice now that I’ve used ISO 8601: 2013-04-09T06:06:06Z is roughly when I’m writing this paragraph. As a format for representing this times this has obvious benefits and equally obvious drawbacks. The benefit is that it’s pretty clear, and almost human readable (by humans like me). A drawback is that it takes a lot of space: 20 characters as I’ve given it. If you were storing this in a file you could reasonably use the “compact” form: 20130409060606Z, which is 15 characters, or 14 without the ‘Z’. Space misers could encode this in BCD and fit it in 8 bytes, which is no more than a 64-bit time_t.

Another drawback with ISO 8601 is that calculations with times can be a bit trickier. But on the other hand, you’re more likely to get correct answers, and be able to represent all (recent and future) times.

I’ve been working with sea-level and tide data recently and Hilary’s floating point number of days is certainly pretty convenient for that. The calculations are concerned with the average rotation of the earth, and it nicely represents that. Practitioners of the art also seem to like a slightly different time scale which is Julian centuries. Basically the same but scaled by 36525. It’s not unusual for the 0-point to be something like 1900-01-01T00:00:00, or even a midday like 1899-12-31T12:00:00.

What about representing times with sub-second accuracy? When Unix time_t type is an integer type then you’re stuck. ISO 8601 times can be extended with a decimal point and have as many decimal places as you like. So that’s a good representation. Hilary’s floating point number of days is also pretty good, as long as you’re using double precision. If you’re using double precision then even counting Julian centuries is okay, you still get sub-nanosecond precision. JavaScript’s representation is basically a floating point number that is the Unix time_t but in milliseconds not seconds. That too is pretty good.

I think Hilary’s format is actually pretty reasonable. Some calculations (involving counting whole numbers of days between times) are easier, some are tricky (but no trickier than any other format really). All times are representable. There’s some ambiguity about times in the future (that 1200 appointment you make in the year 2020 might actually turn up in your calendar at 12:00:00.5 if there is a leap second, but at least your midnight appointments won’t be on the wrong day). Yes, it’s a litle bit funky that on days with leap seconds the step between consecutive seconds will be different (it will in 1/86359 or 1/86401 instead of 1/86400), but I’m sure you’ll cope.

After a while, you just get used to seeing a new file format, another new date representation. According to the xlrd documentation Excel uses a floating point number of days since 1899-12-30T00:00:00, but since they thought 1900 was a leap year, this is only reliable after 1900-03-01. Unless the Excel file was made on a Mac, in which case the Epoch is 1904-01-01. The TOPEX/Jason data for global mean sea level represents dates as a decimal fraction of years.

The bottom line is: There are all sorts of crazy date formats. Get over it.

I recommend:

1) using a good library.
2) for storage, use ISO 8601
3) for runtime/manipulation use double.


Follow

Get every new post delivered to your Inbox.