On compiling 34 year old C code

2013-09-01

The 34 year old C code is the C code for ed and sed from Unix Version 7. I’ve been getting it to compile on a modern POSIXish system.

Some changes had to be made. But not very many.

The union hack

sed uses a union to save space in a struct. Specifically, the reptr union has two sub structs that differ only in that one of them has a char *re1 field where the other has a union reptr *lb1. In the old days it was possible to access members of structs inside unions without having to name the intermediate struct. For example the code in the sed implementation uses rep->ad1 instead of rep->reptr1.ad1. That’s no longer possible (I’m pretty sure this shortcut was already out of fashion by the time K&R was published in 1978, but I don’t have a copy to hand).

I first changed the union to a struct that had a union inside it only for the two members that differed:

struct	reptr {
		char	*ad1;
		char	*ad2;
	union {
		char	*re1;
		struct reptr	*lb1;
        } u;
		char	*rhs;
		FILE	*fcode;
		char	command;
		char	gfl;
		char	pfl;
		char	inar;
		char	negfl;
} ptrspace[PTRSIZE], *rep;

The meant changing a few “union reptr” to “struct reptr”, but most of the member accesses would be unchanged. ->re1 had to be changed to ->u.re1, but that’s a simple search and replace.

It wasn’t until I was explaining this ghastly use of union to Paul a day later that I realised the union is a completely unnecessary space-saving micro-optimisation. We can just have a plain struct where only one of the two fields re1 and lb1 were ever used. That’s much nicer, and so is the code.

The rise of headers

In K&R C if the function you were calling returned int then you didn’t need to declare it before calling it. Many functions that in modern times return void, used to return int (which is to say, they didn’t declare what they returned, so it defaulted to int, and if the function used plain return; then that was okay as long as the caller didn’t use the return value). exit() is such a function. sed calls it without declaring it first, and that generates a warning:

sed0.c:48:3: warning: incompatible implicit declaration of built-in function ‘exit’ [enabled by default]

I could declare exit() explicitly, but it seemed simpler to just add #include <stdlib.h>. And it is.

ed declares some of the library functions it uses explicitly. Like malloc():

char	*malloc();

That’s a problem because the declaration of malloc() has changed. Now it’s void *malloc(size_t). This is a fatal violation of the C standard that the compiler is not obliged to warn me about, but thankfully it does.

The modern fix is again to add header files. Amusingly, version 7 Unix didn’t even have a header file that declared malloc().

When it comes to mktemp() (which is also declared explicitly rather than via a header file), ed has a problem:

tfname = mktemp("/tmp/eXXXXX");

2 problems in fact. One is that modern mktemp() expects its argument to have 6 “X”s, not 5. But the more serious problem is that the storage pointed to by the argument will be modified, and the string literal that is passed is not guaranteed to be placed in a modifiable data region. I’m surprised this worked in Version 7 Unix. These days not only is it not guaranteed to work, it doesn’t actually work. SEGFAULT because mktemp() tries to write into a read-only page.

And the 3rd problem is of course that mktemp() is a problematic interface so the better mkstemp() interface made it into the POSIX standard and mktemp() did not.

Which brings me to…

Unfashionable interfaces

ed uses gtty() and stty() to get and set the terminal flags (to switch off ECHO while the encryption key is read from the terminal). Not only is gtty() unfashionable in modern Unixes (I replaced it with tcgettattr()), it was unfashionable in Version 7 Unix. It’s not documented in the man pages; instead, tty(4) documents the use of the TIOCGETP command to ioctl().

ed is already using a legacy interface in 1979.

9 Responses to “On compiling 34 year old C code”

  1. Tony Finch Says:

    What you call the union hack is very different from the modern idea of unnamed unions. In fact it is a misnomer, and is more to do with the behaviour of struct and union members.

    In pre standard C, there was a single namespace for all struct and union members. The compiler basically ignored the type of the pointer that was being dereferenced when compiling the -> operator, and just got the offset and referent type from the member name.

    This is why in old unix code, structure member names have type tags in their names, like s_ in struct stat.

    It also led to code that looks very weird from the modern point of view. The v6 kernel has a declaration

    struct { int integ; };

    which in modern C is useless, but at that time it allowed you to treat any value as an integer pointer. There were macros which expanded to the addresses of memory mapped hardware – bare integer constants – which could be accessed as MACRO->integ.

    • Taniwha Says:

      I think that this is a different issue – old C had no unions, instead structure names were a global namespace (they could be reused proved their offset was the same) and any pointer could be used – the kernel was full of structures that started with the same generic header followed by device specific stuff that was different, now days we’d use a union for that


  2. Anonymous unions/structs were added back in C11, and (maybe, because) they were never removed from C++.

    anaxagoras:~ mjb67local$ cat foo.c

    struct A {
    union {
    char b;
    int c;
    };
    };

    int main()
    {
    struct A a;
    a.c = 0;
    return a.c;
    }

    anaxagoras:~ mjb67local$ gcc -Wall foo.c

    anaxagoras:~ mjb67local$ gcc -Wall -pedantic foo.c
    foo.c:2:3: warning: anonymous unions are a C11 extension [-Wc11-extensions]
    union {
    ^
    1 warning generated.

    anaxagoras:~ mjb67local$ ln -sf foo.c foo.cpp

    anaxagoras:~ mjb67local$ g++ -Wall -pedantic foo.cpp

    anaxagoras:~ mjb67local$

  3. kbob Says:

    Ah, memories…

    String literals were mutable (in the .data segment) at least through the mid 1980s, and changing them at run time was not an uncommon practice.

    I also vaguely recall stty being a well-known library routine — are you sure there is not an stty(3) man page? If not, then the source for all of libc was just a few directories away…

    And $DIETY will consign you to eternal damnation for wasting tens, maybe hundreds of bytes of precious core in expanding that union! (-:

    • drj11 Says:

      The wastage is probably more like 32 bits per space. 64 bits in total. :)

    • unwesen Says:

      Yes. And I know at least one project that, instead of fixing the code, added command line flags to the compiler to keep strings mutable.

      Unfortunately, there were two functions in the code that referenced the same string global: one expected it to be mutable, the other expected it to always contain a particular string (think format string).

      So there was a conflict. Since it was heavily used legacy code, the mandate was to keep it running as-is. That meant no changes that management could not understand, and this particular interaction they could not understand.

      As a result, we had the conflicting instructions to a) not change either of the functions, and b) fix the spurious bugs that occurred in them.

      This should really be on the daily WTF.

  4. kimtoms Says:

    Regarding using the return; statement from an int function, I worked with a compiler for the PDP-11 (16 bit machine) that would return that last calculated value from a function. The PDP-11 had 8 registers, and R0 was the return value. We had to use it in one function for some reason which I can’t recall. It was a long time ago.

  5. neal Says:

    why did you need to compile this old code, or was this just for a educational experience?


Leave a reply to drj11 Cancel reply