Women of #movember

2013-11-05

In the interests of brevity the title says “women” but I mean to include all partners of people growing a mo’ this month.

I am growing a moustache to support a local prostate cancer group.

The reality is that growing a moustache is not hard. I just have to not shave a certain part of my face for a while. Most of the hardship is endured by my partner Philippa who has to put up with bristly kisses for a month.

Support is vital for success. Support for survivors of cancer, and support for sufferers of cancer. Support for people raising money.

I’d like to thank all the partners out there that are supporting their men growing a moustache. I would not be able to do this without the support of my family and friends, and I’m sure others feel similarly.

You can give your support here: https://www.justgiving.com/drj11.


Explaining p += q in Python

2013-10-29

If you’re a Python programmer you should know about the augmented assignment statements:


i += 1

This adds one to i. There is a whole host of these augmented operators (-=, *=, /=, %= etc).

Aside: I used to call these assignment operators which is the C terminology, but in Python assignment is a statement, not an expression (yay!): you can’t go while i -= 1 (and this is a Good Thing).

An augmented assignment like i += 1 is often described as being the same as i = i + 1, except that i can be a complicated expression and it will not be evaluated twice.

As Julian Todd pointed out to me (a couple of years ago now), this is not quite right when it comes to lists.

Recall that when p and q are lists, p + q is a fresh list that is neither p nor q:


>>> p = [1]
>>> q = [2]
>>> r = p + q
>>> r
[1, 2]
>>> r is p
False
>>> r is q
False

So p += q should be the same as p = p + q, which creates a fresh list and assigns a reference to it to p, right?

No. It’s a little bit tricky to convince yourself of this fact; you have to keep a second reference to the original p (called op below):


>>> p = [1]
>>> op = p
>>> p += [2]
>>> p
[1, 2]
>>> op
[1, 2]
>>> p is op
True

Here it is in pictures:
before.dot
fresh.dot
augment.dot

Because of this, it’s slightly harder to explain how the += assignment statement behaves. For numbers we can explain it by breaking it down into a + operator and an assignment, but for lists this explanation fails because it doesn’t explain how p (in our p += q example) retains its identity (the curious will have already found out that+= is implemented by calling the __iadd__ method of p).

What about tuples?

When p and q are tuples the behaviour of += is more like numbers than lists. A fresh tuple is created. It has to be, since you can’t mutate a tuple.

This kind of issue, the difference between creating a fresh object and mutating an existing one, lies at the heart of understanding the P languages (perl, Python, PHP, Ruby, JavaScript).

The keen may wish to fill in this table:

p q p + q p += q
list list fresh p mutated
tuple tuple fresh fresh
list tuple ? ?
tuple list ? ?

Sunrise

2013-10-16

One of the benefits of getting the train from Sheffield to Liverpool is the beautiful views of the Peak District that I don’t pay enough attention to.

Winter is coming, consequently, it’s just the right time to see the sunrise:

copper bronze dashed across the sky
blots and streaks of morning’s blue
a speck of an aeroplane drifted lazily
the buildings of flats and industry were silhouetted
brick chimneys thrust industrially upward
we arrived in Manchester.


On compiling 34 year old C code

2013-09-01

The 34 year old C code is the C code for ed and sed from Unix Version 7. I’ve been getting it to compile on a modern POSIXish system.

Some changes had to be made. But not very many.

The union hack

sed uses a union to save space in a struct. Specifically, the reptr union has two sub structs that differ only in that one of them has a char *re1 field where the other has a union reptr *lb1. In the old days it was possible to access members of structs inside unions without having to name the intermediate struct. For example the code in the sed implementation uses rep->ad1 instead of rep->reptr1.ad1. That’s no longer possible (I’m pretty sure this shortcut was already out of fashion by the time K&R was published in 1978, but I don’t have a copy to hand).

I first changed the union to a struct that had a union inside it only for the two members that differed:

struct	reptr {
		char	*ad1;
		char	*ad2;
	union {
		char	*re1;
		struct reptr	*lb1;
        } u;
		char	*rhs;
		FILE	*fcode;
		char	command;
		char	gfl;
		char	pfl;
		char	inar;
		char	negfl;
} ptrspace[PTRSIZE], *rep;

The meant changing a few “union reptr” to “struct reptr”, but most of the member accesses would be unchanged. ->re1 had to be changed to ->u.re1, but that’s a simple search and replace.

It wasn’t until I was explaining this ghastly use of union to Paul a day later that I realised the union is a completely unnecessary space-saving micro-optimisation. We can just have a plain struct where only one of the two fields re1 and lb1 were ever used. That’s much nicer, and so is the code.

The rise of headers

In K&R C if the function you were calling returned int then you didn’t need to declare it before calling it. Many functions that in modern times return void, used to return int (which is to say, they didn’t declare what they returned, so it defaulted to int, and if the function used plain return; then that was okay as long as the caller didn’t use the return value). exit() is such a function. sed calls it without declaring it first, and that generates a warning:

sed0.c:48:3: warning: incompatible implicit declaration of built-in function ‘exit’ [enabled by default]

I could declare exit() explicitly, but it seemed simpler to just add #include <stdlib.h>. And it is.

ed declares some of the library functions it uses explicitly. Like malloc():

char	*malloc();

That’s a problem because the declaration of malloc() has changed. Now it’s void *malloc(size_t). This is a fatal violation of the C standard that the compiler is not obliged to warn me about, but thankfully it does.

The modern fix is again to add header files. Amusingly, version 7 Unix didn’t even have a header file that declared malloc().

When it comes to mktemp() (which is also declared explicitly rather than via a header file), ed has a problem:

tfname = mktemp("/tmp/eXXXXX");

2 problems in fact. One is that modern mktemp() expects its argument to have 6 “X”s, not 5. But the more serious problem is that the storage pointed to by the argument will be modified, and the string literal that is passed is not guaranteed to be placed in a modifiable data region. I’m surprised this worked in Version 7 Unix. These days not only is it not guaranteed to work, it doesn’t actually work. SEGFAULT because mktemp() tries to write into a read-only page.

And the 3rd problem is of course that mktemp() is a problematic interface so the better mkstemp() interface made it into the POSIX standard and mktemp() did not.

Which brings me to…

Unfashionable interfaces

ed uses gtty() and stty() to get and set the terminal flags (to switch off ECHO while the encryption key is read from the terminal). Not only is gtty() unfashionable in modern Unixes (I replaced it with tcgettattr()), it was unfashionable in Version 7 Unix. It’s not documented in the man pages; instead, tty(4) documents the use of the TIOCGETP command to ioctl().

ed is already using a legacy interface in 1979.


CoffeeScript is 4 times shorter than C

2013-08-30

Compare FreeBSD’s sed.c (implemented in C) with drj’s sed.js (implemented in CoffeeScript).

A size comparison

              lines bytes
C             2471  62k
CoffeeScript  533   14k

That right there is CoffeeScript’s chief advantage over C. The CoffeeScript is 4 times smaller, meaning CoffeeScript programs are likely to be faster to write, easier to maintain, and contain fewer bugs.

Is it an apples to apples comparison? The FreeBSD sed implements a couple more options than my mostly POSIX compliant sed.js but they don’t pad it out much. Neither implementation includes code to handle Regular Expressions. sed.c uses the Unix library and sed.coffee uses JavaScript’s built-in RegExp class. So I think it’s a pretty reasonable comparison.

What makes the C code bigger? There is a lot of memory management, and an awful lot of string copying and substring extraction. There is an implementation of a hash table (for labels).

Obviously we all knew that high level languages resulted in smaller programs, but it’s rare to get the opportunity to do such a direct comparison.


sed, POSIX, and Node.js

2013-07-11

I’ve been implementing sed. A POSIX compatible sed in Node.js.

It just seemed to me that one day soon the world will need a suite of Unix utilities written in Node.js. And I shall be The One.

The experience has made me a bit sad about the POSIX spec. There are problems. For example, it’s not very good at documenting the actual or desired behaviour of classic Unix utilities:

sed has a D command.

This command deletes the initial portion of the pattern space up to the first newline (which may be the entire pattern space if a newline has not been introduced with an editing command or with N); then D begins a new cycle. At the start of this new cycle, the next line of input is loaded into the pattern space, but ONLY IF THE PATTERN SPACE IS EMPTY.

This last bit is missing from the 2004 edition of the POSIX spec. It’s fixed and documented correctly in the 2013 edition of the POSIX spec.

The behaviour of sed hasn’t changed since Version 7 in 1979. The D command has always skipped appending input (if there was anything left in the pattern space). Probably no sed ever had its D command behave in the way documented in the 2004 POSIX spec. Maybe if someone was to try building a version of sed from scratch using the 2004 POSIX spec and without reference to any other sed implementations. But who would be mad enough to do that?

At some point someone drafting the POSIX spec didn’t notice the actual behaviour of sed, made a mistake in documenting the behaviour of its D command, and noone noticed until 2013 (well, a few years before, presumably). Which brings me to…

The pace of change is glacial.

Another thing about the POSIX spec which saddened me a little was the way all sorts of bizarre, obscure, and not very useful behaviours get documented and locked-in. You knew that sed has a ! modifier that negates an address. So «sed -n /barf/!p» prints all the lines that do NOT match /barf/. Did you know you can have as many ! as you like? «sed -n /barf/!!!!p» has the same behaviour as the previous program. At least according to the spec, and the Version 7 sed that I tried. There’s no point to this. No real program relies on this behaviour, yet there it is in the spec, so you have to implement it (if you want to comply to the spec). GNU sed (popular on Linux) gives an error instead. Which brings me to…

You can’t really rely on what you read in the spec being implemented.

or…

GNU feel free to depart from the spec whenever they see fit to do so.

sed is a bit weak. For example, its regular expressions (POSIX Basic Regular Expressions) don’t even support «|» for alternation. POSIX has Extended Regular Expressions. Wouldn’t it be sensible to move towards adding Extended Regular Expressions to all the tools that only had Basic Regular Expressions? Well, maybe yes, but there seems to be no taste for doing that in a POSIX committee. And remember…

The pace of change is glacial.


Let me start in a way that i

2013-06-07

Let me start in a way that i hope you find controversial.

I am against gay marriage. The state should not recognise the marriages of gay couples.

But fear not, I’m a nice liberal Guardian-reading libdem-labour-green swing voter. I am not about to join the UKIP. To expand upon my position:

I am against marriage. The state should not recognise the marriages of couples.

Marriage is a contract between 2 people (and god, if you choose to believe). I don’t see why the state needs to get involved.

There are various rights and responsibilities that come with marriage, but in all cases i think it would be better if these were separated:

- married fathers have more rights (for example, direction of medical care) over their children than unmarried ones. That’s just silly.

- married couples have various tax advantages (rarely, disadvantages) depending of the whims of the current government. It really depends what the policy is here. Is it to encourage people to form lasting bonds, or is it to recognise that people who stay at home to manage a household deserve a tax break to recognise their economic contributions and compensate for their lack of earnings? Two sisters share a home where one has a full-time job and the other looks after the house and does a few hours teaching a week (maybe they co-parent a child). Should they not be able to pool their tax-free allowance? Peoples lives are entangled in all sorts of ways that the “married couple” paradigm does not recognise.

- married couples have access to the divorce courts. Unmarried couples would benefit from the guidance and impartiality of the divorce courts too. Separation is stressful, messy, confusing thing to go through, and of course the divorce court does not wave a magic wand and make it all unicorns and ponies, but i think it is helpful, and it would be helpful to all couples.

- the estate is presumed to pass to the surviving widow when one of a married couple dies.

It’s not that i don’t think the state should be doing these things, in most cases i think they should. It’s that i don’t think these things should be bundled up in Marriage. Sharing your tax-free allowance should be a matter of filing a form with the tax office. Letting your estate pass to another on your death should again, just be a matter of filing a form.

Asserting paternal rights over a child is a lot more messy. Should it matter if you’re not the genetic father of a child, if you’re the one providing the home and investing in the child’s development? Should the mother be allowed to prevent access to a supposed father? I have no idea, but whether a couple is married or not should not matter.

Public recognition. Of course people should be allowed to marry and be married. Public recognition is one of the most important aspects of marriage. But this recognition comes from your peers, not from the state. Couples have been getting married for thousands of years, long before governments came along to record the marriages.

Marriage as a union for life between a man and a woman is a christian thing. Marriage as actually practiced over the last 5000 years or so is much more diverse. We should embrace that diversity (once again). Capturing this in leglisation would be a complicated nightmare.

The simplest thing the state can do is not get involved.


Date formats

2013-04-12

Hilary Mason on twitter bemoaned the fact that matplotlib appears to use a floating point number of days to represent datetimes and suggested that “any other standard format” would be better. It is a little bit odd that a Python library uses that format, but it’s presumably because matlab does, and matplotlib is betraying its matlab heritage.

What other standards are there? Well, there’s POSIX time_t which is either an integer or a floating-point type (but actual practice seems to favour integer), and stores the number of seconds since the Unix Epoch, 1970-01-01T00:00:00Z. Not counting leap seconds. Not counting leap seconds is convenient for some calculations, but means there are (recent) times that cannot be represented (namely, any moment during a leap second). That’s not a good representation of time.

There’s another time format in use which is confusingly similar to POSIX’s time_t: it counts time using the number of seconds since the Unix Epoch. Including leap seconds. The opportunities for mistakes in conversion are endless. A problem with this time format is that times in the future (more than a couple of years into the future XX) are ambiguous. Since we don’t yet know if there are going to be leap seconds between now and 2020 we don’t know whether 2020-01-01T00:00:00Z is 1577836800 or a few seconds more or less than that.

An example confusion: “man 2 time” on my Linux box claims that time_t is the number of seconds since Epoch, but “date –date=2013-01-01 +%s” ends in “00” so clearly leap seconds aren’t being accounted for.

That’s twice now that I’ve used ISO 8601: 2013-04-09T06:06:06Z is roughly when I’m writing this paragraph. As a format for representing this times this has obvious benefits and equally obvious drawbacks. The benefit is that it’s pretty clear, and almost human readable (by humans like me). A drawback is that it takes a lot of space: 20 characters as I’ve given it. If you were storing this in a file you could reasonably use the “compact” form: 20130409060606Z, which is 15 characters, or 14 without the ‘Z’. Space misers could encode this in BCD and fit it in 8 bytes, which is no more than a 64-bit time_t.

Another drawback with ISO 8601 is that calculations with times can be a bit trickier. But on the other hand, you’re more likely to get correct answers, and be able to represent all (recent and future) times.

I’ve been working with sea-level and tide data recently and Hilary’s floating point number of days is certainly pretty convenient for that. The calculations are concerned with the average rotation of the earth, and it nicely represents that. Practitioners of the art also seem to like a slightly different time scale which is Julian centuries. Basically the same but scaled by 36525. It’s not unusual for the 0-point to be something like 1900-01-01T00:00:00, or even a midday like 1899-12-31T12:00:00.

What about representing times with sub-second accuracy? When Unix time_t type is an integer type then you’re stuck. ISO 8601 times can be extended with a decimal point and have as many decimal places as you like. So that’s a good representation. Hilary’s floating point number of days is also pretty good, as long as you’re using double precision. If you’re using double precision then even counting Julian centuries is okay, you still get sub-nanosecond precision. JavaScript’s representation is basically a floating point number that is the Unix time_t but in milliseconds not seconds. That too is pretty good.

I think Hilary’s format is actually pretty reasonable. Some calculations (involving counting whole numbers of days between times) are easier, some are tricky (but no trickier than any other format really). All times are representable. There’s some ambiguity about times in the future (that 1200 appointment you make in the year 2020 might actually turn up in your calendar at 12:00:00.5 if there is a leap second, but at least your midnight appointments won’t be on the wrong day). Yes, it’s a litle bit funky that on days with leap seconds the step between consecutive seconds will be different (it will in 1/86359 or 1/86401 instead of 1/86400), but I’m sure you’ll cope.

After a while, you just get used to seeing a new file format, another new date representation. According to the xlrd documentation Excel uses a floating point number of days since 1899-12-30T00:00:00, but since they thought 1900 was a leap year, this is only reliable after 1900-03-01. Unless the Excel file was made on a Mac, in which case the Epoch is 1904-01-01. The TOPEX/Jason data for global mean sea level represents dates as a decimal fraction of years.

The bottom line is: There are all sorts of crazy date formats. Get over it.

I recommend:

1) using a good library.
2) for storage, use ISO 8601
3) for runtime/manipulation use double.


All I need to know to learn R

2012-10-05

I’ve been learning R, mostly because it’s been on my list of things to do for ages, and partly because I needed to draw a histogram.

All the tedious stuff about how you get started and how you install things is surprisingly difficult to get from the internets.

So to install on Ubunutu:

sudo apt-get install r-base

Then you run the command R:

$ R

Once you’re in R you can find things out by using Google. Once you’ve found a function you want to use, say sj.test, if it doesn’t seem to be installed, you can install it by noting the library name which is in curly brackets at the top of the man page. {lawstat} in this case. So you then go:

> install.packages("lawstat") # to install it
> library(lawstat) # to use it

(the package installer has a hilariously craptastic interface written in Tcl/Tk)

That’s it. Everything I need to know to learn R. Everything else is just bog standard programming language stuff (though it helps a bit that I learnt a bit of J).

Here’s the histogram:

And the R code is behind this link. The way the function png() implicitly makes hist() write into the PNG file is particularly bletcherous. It has all the elegance of writing JCL for IBM mainframes.


Bosch: the Constructor

2012-05-17

We got some second-hand Duplo recently, and washed it. In Bosch, our washing machine. In a pillow case at 30°C on the delicate program. Makes a funny noise but it all came out fine. This was the first time I’d washed Lego in a machine, so it’s nice to know that the suggestion in the Lego FAQ works.

The curious thing was that the Duplo had SELF ASSEMBLED. It went in all in separated bricks. But when it comes out, some of it has stuck together. I found this pretty amazing. Surely if Lego can self assemble in the washing machine then it is a simple step from there to the beginning of life itself.

The variety of forms is interesting. Simple diatoms… in mixed colours:

…and also monochrome:

Simple towers, reminiscent of the things my little nephew makes, when he can be bothered:

Then there comes the forms that are not so easy to characterise. I like to call them monsters:

I’m particularly impressed with Bosch’s creative instincts here. The one at the front is a battleship with effective use of colour. The back left reminds me a bit of a tree. Perhaps some sort of highly coloured baobab tree. And is that a red boot kicking a yellow football on a field of green?

Francis speculates (whimsically) that maybe my washing machine has passed the singularity and is in fact an Artifical Intelligence trying to communicate with me. Obviously it is constrained by only being able to communicate by rearranging whatever I put into the washing machine. And the fact that it is a 4-bit microcontroller attached only to a valve, a heater, and a motor (maybe a temperature sensor too, but I wouldn’t be surprised if there wasn’t one). It’s an intriguing possibility. Perhaps the form on the right of monsters represents a not quite complete utterance, the Duplo bricks not quite bonded properly.

The AI hypothesis raises several questions: How can we test it? Can we distinguish between merely aleatoric arrangements, and intentional ones? What is Bosch trying to say? Is there something I can put in the washing machine that would make it easier for Bosch to communicate? Is Bosch happy?


Follow

Get every new post delivered to your Inbox.