On compiling 34 year old C code

2013-09-01

The 34 year old C code is the C code for ed and sed from Unix Version 7. I’ve been getting it to compile on a modern POSIXish system.

Some changes had to be made. But not very many.

The union hack

sed uses a union to save space in a struct. Specifically, the reptr union has two sub structs that differ only in that one of them has a char *re1 field where the other has a union reptr *lb1. In the old days it was possible to access members of structs inside unions without having to name the intermediate struct. For example the code in the sed implementation uses rep->ad1 instead of rep->reptr1.ad1. That’s no longer possible (I’m pretty sure this shortcut was already out of fashion by the time K&R was published in 1978, but I don’t have a copy to hand).

I first changed the union to a struct that had a union inside it only for the two members that differed:

struct	reptr {
		char	*ad1;
		char	*ad2;
	union {
		char	*re1;
		struct reptr	*lb1;
        } u;
		char	*rhs;
		FILE	*fcode;
		char	command;
		char	gfl;
		char	pfl;
		char	inar;
		char	negfl;
} ptrspace[PTRSIZE], *rep;

The meant changing a few “union reptr” to “struct reptr”, but most of the member accesses would be unchanged. ->re1 had to be changed to ->u.re1, but that’s a simple search and replace.

It wasn’t until I was explaining this ghastly use of union to Paul a day later that I realised the union is a completely unnecessary space-saving micro-optimisation. We can just have a plain struct where only one of the two fields re1 and lb1 were ever used. That’s much nicer, and so is the code.

The rise of headers

In K&R C if the function you were calling returned int then you didn’t need to declare it before calling it. Many functions that in modern times return void, used to return int (which is to say, they didn’t declare what they returned, so it defaulted to int, and if the function used plain return; then that was okay as long as the caller didn’t use the return value). exit() is such a function. sed calls it without declaring it first, and that generates a warning:

sed0.c:48:3: warning: incompatible implicit declaration of built-in function ‘exit’ [enabled by default]

I could declare exit() explicitly, but it seemed simpler to just add #include <stdlib.h>. And it is.

ed declares some of the library functions it uses explicitly. Like malloc():

char	*malloc();

That’s a problem because the declaration of malloc() has changed. Now it’s void *malloc(size_t). This is a fatal violation of the C standard that the compiler is not obliged to warn me about, but thankfully it does.

The modern fix is again to add header files. Amusingly, version 7 Unix didn’t even have a header file that declared malloc().

When it comes to mktemp() (which is also declared explicitly rather than via a header file), ed has a problem:

tfname = mktemp("/tmp/eXXXXX");

2 problems in fact. One is that modern mktemp() expects its argument to have 6 “X”s, not 5. But the more serious problem is that the storage pointed to by the argument will be modified, and the string literal that is passed is not guaranteed to be placed in a modifiable data region. I’m surprised this worked in Version 7 Unix. These days not only is it not guaranteed to work, it doesn’t actually work. SEGFAULT because mktemp() tries to write into a read-only page.

And the 3rd problem is of course that mktemp() is a problematic interface so the better mkstemp() interface made it into the POSIX standard and mktemp() did not.

Which brings me to…

Unfashionable interfaces

ed uses gtty() and stty() to get and set the terminal flags (to switch off ECHO while the encryption key is read from the terminal). Not only is gtty() unfashionable in modern Unixes (I replaced it with tcgettattr()), it was unfashionable in Version 7 Unix. It’s not documented in the man pages; instead, tty(4) documents the use of the TIOCGETP command to ioctl().

ed is already using a legacy interface in 1979.


CoffeeScript is 4 times shorter than C

2013-08-30

Compare FreeBSD’s sed.c (implemented in C) with drj’s sed.js (implemented in CoffeeScript).

A size comparison

              lines bytes
C             2471  62k
CoffeeScript  533   14k

That right there is CoffeeScript’s chief advantage over C. The CoffeeScript is 4 times smaller, meaning CoffeeScript programs are likely to be faster to write, easier to maintain, and contain fewer bugs.

Is it an apples to apples comparison? The FreeBSD sed implements a couple more options than my mostly POSIX compliant sed.js but they don’t pad it out much. Neither implementation includes code to handle Regular Expressions. sed.c uses the Unix library and sed.coffee uses JavaScript’s built-in RegExp class. So I think it’s a pretty reasonable comparison.

What makes the C code bigger? There is a lot of memory management, and an awful lot of string copying and substring extraction. There is an implementation of a hash table (for labels).

Obviously we all knew that high level languages resulted in smaller programs, but it’s rare to get the opportunity to do such a direct comparison.


sed, POSIX, and Node.js

2013-07-11

I’ve been implementing sed. A POSIX compatible sed in Node.js.

It just seemed to me that one day soon the world will need a suite of Unix utilities written in Node.js. And I shall be The One.

The experience has made me a bit sad about the POSIX spec. There are problems. For example, it’s not very good at documenting the actual or desired behaviour of classic Unix utilities:

sed has a D command.

This command deletes the initial portion of the pattern space up to the first newline (which may be the entire pattern space if a newline has not been introduced with an editing command or with N); then D begins a new cycle. At the start of this new cycle, the next line of input is loaded into the pattern space, but ONLY IF THE PATTERN SPACE IS EMPTY.

This last bit is missing from the 2004 edition of the POSIX spec. It’s fixed and documented correctly in the 2013 edition of the POSIX spec.

The behaviour of sed hasn’t changed since Version 7 in 1979. The D command has always skipped appending input (if there was anything left in the pattern space). Probably no sed ever had its D command behave in the way documented in the 2004 POSIX spec. Maybe if someone was to try building a version of sed from scratch using the 2004 POSIX spec and without reference to any other sed implementations. But who would be mad enough to do that?

At some point someone drafting the POSIX spec didn’t notice the actual behaviour of sed, made a mistake in documenting the behaviour of its D command, and noone noticed until 2013 (well, a few years before, presumably). Which brings me to…

The pace of change is glacial.

Another thing about the POSIX spec which saddened me a little was the way all sorts of bizarre, obscure, and not very useful behaviours get documented and locked-in. You knew that sed has a ! modifier that negates an address. So «sed -n /barf/!p» prints all the lines that do NOT match /barf/. Did you know you can have as many ! as you like? «sed -n /barf/!!!!p» has the same behaviour as the previous program. At least according to the spec, and the Version 7 sed that I tried. There’s no point to this. No real program relies on this behaviour, yet there it is in the spec, so you have to implement it (if you want to comply to the spec). GNU sed (popular on Linux) gives an error instead. Which brings me to…

You can’t really rely on what you read in the spec being implemented.

or…

GNU feel free to depart from the spec whenever they see fit to do so.

sed is a bit weak. For example, its regular expressions (POSIX Basic Regular Expressions) don’t even support «|» for alternation. POSIX has Extended Regular Expressions. Wouldn’t it be sensible to move towards adding Extended Regular Expressions to all the tools that only had Basic Regular Expressions? Well, maybe yes, but there seems to be no taste for doing that in a POSIX committee. And remember…

The pace of change is glacial.


Let me start in a way that i

2013-06-07

Let me start in a way that i hope you find controversial.

I am against gay marriage. The state should not recognise the marriages of gay couples.

But fear not, I’m a nice liberal Guardian-reading libdem-labour-green swing voter. I am not about to join the UKIP. To expand upon my position:

I am against marriage. The state should not recognise the marriages of couples.

Marriage is a contract between 2 people (and god, if you choose to believe). I don’t see why the state needs to get involved.

There are various rights and responsibilities that come with marriage, but in all cases i think it would be better if these were separated:

- married fathers have more rights (for example, direction of medical care) over their children than unmarried ones. That’s just silly.

- married couples have various tax advantages (rarely, disadvantages) depending of the whims of the current government. It really depends what the policy is here. Is it to encourage people to form lasting bonds, or is it to recognise that people who stay at home to manage a household deserve a tax break to recognise their economic contributions and compensate for their lack of earnings? Two sisters share a home where one has a full-time job and the other looks after the house and does a few hours teaching a week (maybe they co-parent a child). Should they not be able to pool their tax-free allowance? Peoples lives are entangled in all sorts of ways that the “married couple” paradigm does not recognise.

- married couples have access to the divorce courts. Unmarried couples would benefit from the guidance and impartiality of the divorce courts too. Separation is stressful, messy, confusing thing to go through, and of course the divorce court does not wave a magic wand and make it all unicorns and ponies, but i think it is helpful, and it would be helpful to all couples.

- the estate is presumed to pass to the surviving widow when one of a married couple dies.

It’s not that i don’t think the state should be doing these things, in most cases i think they should. It’s that i don’t think these things should be bundled up in Marriage. Sharing your tax-free allowance should be a matter of filing a form with the tax office. Letting your estate pass to another on your death should again, just be a matter of filing a form.

Asserting paternal rights over a child is a lot more messy. Should it matter if you’re not the genetic father of a child, if you’re the one providing the home and investing in the child’s development? Should the mother be allowed to prevent access to a supposed father? I have no idea, but whether a couple is married or not should not matter.

Public recognition. Of course people should be allowed to marry and be married. Public recognition is one of the most important aspects of marriage. But this recognition comes from your peers, not from the state. Couples have been getting married for thousands of years, long before governments came along to record the marriages.

Marriage as a union for life between a man and a woman is a christian thing. Marriage as actually practiced over the last 5000 years or so is much more diverse. We should embrace that diversity (once again). Capturing this in leglisation would be a complicated nightmare.

The simplest thing the state can do is not get involved.


Date formats

2013-04-12

Hilary Mason on twitter bemoaned the fact that matplotlib appears to use a floating point number of days to represent datetimes and suggested that “any other standard format” would be better. It is a little bit odd that a Python library uses that format, but it’s presumably because matlab does, and matplotlib is betraying its matlab heritage.

What other standards are there? Well, there’s POSIX time_t which is either an integer or a floating-point type (but actual practice seems to favour integer), and stores the number of seconds since the Unix Epoch, 1970-01-01T00:00:00Z. Not counting leap seconds. Not counting leap seconds is convenient for some calculations, but means there are (recent) times that cannot be represented (namely, any moment during a leap second). That’s not a good representation of time.

There’s another time format in use which is confusingly similar to POSIX’s time_t: it counts time using the number of seconds since the Unix Epoch. Including leap seconds. The opportunities for mistakes in conversion are endless. A problem with this time format is that times in the future (more than a couple of years into the future XX) are ambiguous. Since we don’t yet know if there are going to be leap seconds between now and 2020 we don’t know whether 2020-01-01T00:00:00Z is 1577836800 or a few seconds more or less than that.

An example confusion: “man 2 time” on my Linux box claims that time_t is the number of seconds since Epoch, but “date –date=2013-01-01 +%s” ends in “00” so clearly leap seconds aren’t being accounted for.

That’s twice now that I’ve used ISO 8601: 2013-04-09T06:06:06Z is roughly when I’m writing this paragraph. As a format for representing this times this has obvious benefits and equally obvious drawbacks. The benefit is that it’s pretty clear, and almost human readable (by humans like me). A drawback is that it takes a lot of space: 20 characters as I’ve given it. If you were storing this in a file you could reasonably use the “compact” form: 20130409060606Z, which is 15 characters, or 14 without the ‘Z’. Space misers could encode this in BCD and fit it in 8 bytes, which is no more than a 64-bit time_t.

Another drawback with ISO 8601 is that calculations with times can be a bit trickier. But on the other hand, you’re more likely to get correct answers, and be able to represent all (recent and future) times.

I’ve been working with sea-level and tide data recently and Hilary’s floating point number of days is certainly pretty convenient for that. The calculations are concerned with the average rotation of the earth, and it nicely represents that. Practitioners of the art also seem to like a slightly different time scale which is Julian centuries. Basically the same but scaled by 36525. It’s not unusual for the 0-point to be something like 1900-01-01T00:00:00, or even a midday like 1899-12-31T12:00:00.

What about representing times with sub-second accuracy? When Unix time_t type is an integer type then you’re stuck. ISO 8601 times can be extended with a decimal point and have as many decimal places as you like. So that’s a good representation. Hilary’s floating point number of days is also pretty good, as long as you’re using double precision. If you’re using double precision then even counting Julian centuries is okay, you still get sub-nanosecond precision. JavaScript’s representation is basically a floating point number that is the Unix time_t but in milliseconds not seconds. That too is pretty good.

I think Hilary’s format is actually pretty reasonable. Some calculations (involving counting whole numbers of days between times) are easier, some are tricky (but no trickier than any other format really). All times are representable. There’s some ambiguity about times in the future (that 1200 appointment you make in the year 2020 might actually turn up in your calendar at 12:00:00.5 if there is a leap second, but at least your midnight appointments won’t be on the wrong day). Yes, it’s a litle bit funky that on days with leap seconds the step between consecutive seconds will be different (it will in 1/86359 or 1/86401 instead of 1/86400), but I’m sure you’ll cope.

After a while, you just get used to seeing a new file format, another new date representation. According to the xlrd documentation Excel uses a floating point number of days since 1899-12-30T00:00:00, but since they thought 1900 was a leap year, this is only reliable after 1900-03-01. Unless the Excel file was made on a Mac, in which case the Epoch is 1904-01-01. The TOPEX/Jason data for global mean sea level represents dates as a decimal fraction of years.

The bottom line is: There are all sorts of crazy date formats. Get over it.

I recommend:

1) using a good library.
2) for storage, use ISO 8601
3) for runtime/manipulation use double.


All I need to know to learn R

2012-10-05

I’ve been learning R, mostly because it’s been on my list of things to do for ages, and partly because I needed to draw a histogram.

All the tedious stuff about how you get started and how you install things is surprisingly difficult to get from the internets.

So to install on Ubunutu:

sudo apt-get install r-base

Then you run the command R:

$ R

Once you’re in R you can find things out by using Google. Once you’ve found a function you want to use, say sj.test, if it doesn’t seem to be installed, you can install it by noting the library name which is in curly brackets at the top of the man page. {lawstat} in this case. So you then go:

> install.packages("lawstat") # to install it
> library(lawstat) # to use it

(the package installer has a hilariously craptastic interface written in Tcl/Tk)

That’s it. Everything I need to know to learn R. Everything else is just bog standard programming language stuff (though it helps a bit that I learnt a bit of J).

Here’s the histogram:

And the R code is behind this link. The way the function png() implicitly makes hist() write into the PNG file is particularly bletcherous. It has all the elegance of writing JCL for IBM mainframes.


Bosch: the Constructor

2012-05-17

We got some second-hand Duplo recently, and washed it. In Bosch, our washing machine. In a pillow case at 30°C on the delicate program. Makes a funny noise but it all came out fine. This was the first time I’d washed Lego in a machine, so it’s nice to know that the suggestion in the Lego FAQ works.

The curious thing was that the Duplo had SELF ASSEMBLED. It went in all in separated bricks. But when it comes out, some of it has stuck together. I found this pretty amazing. Surely if Lego can self assemble in the washing machine then it is a simple step from there to the beginning of life itself.

The variety of forms is interesting. Simple diatoms… in mixed colours:

…and also monochrome:

Simple towers, reminiscent of the things my little nephew makes, when he can be bothered:

Then there comes the forms that are not so easy to characterise. I like to call them monsters:

I’m particularly impressed with Bosch’s creative instincts here. The one at the front is a battleship with effective use of colour. The back left reminds me a bit of a tree. Perhaps some sort of highly coloured baobab tree. And is that a red boot kicking a yellow football on a field of green?

Francis speculates (whimsically) that maybe my washing machine has passed the singularity and is in fact an Artifical Intelligence trying to communicate with me. Obviously it is constrained by only being able to communicate by rearranging whatever I put into the washing machine. And the fact that it is a 4-bit microcontroller attached only to a valve, a heater, and a motor (maybe a temperature sensor too, but I wouldn’t be surprised if there wasn’t one). It’s an intriguing possibility. Perhaps the form on the right of monsters represents a not quite complete utterance, the Duplo bricks not quite bonded properly.

The AI hypothesis raises several questions: How can we test it? Can we distinguish between merely aleatoric arrangements, and intentional ones? What is Bosch trying to say? Is there something I can put in the washing machine that would make it easier for Bosch to communicate? Is Bosch happy?


In praise of ⌘

2012-05-04

Control-C is the intr character in Unix.  It sends SIGINT to the currently running process, thereby interrupting it.  Most programs either quit or return to an interactive prompt when they receive this signal.

Control-C is the Copy command in Windows.  It copies the highlighted text (or other graphic objects) to the clipboard.

What do you do when you’re using a terminal emulation program on Windows to connect to a Unix computer?  Does pressing Control-C send a Control-C character to the Unix system, or does it copy the highlighted text to the clipboard?  I have no idea (it’s been a long while since I used a terminal emulator on Windows), but one thing is clear: It’s not clear what the right answer is.

On a Mac Copy is an example of a command to a GUI application.  Keyboard shortcuts are accessed using the Command key.

Command Key

Command Key from wikimedia

This is a stroke of genius.

Now, when using a terminal emulation program on a Mac, it’s completely obvious what happens. Control-C sends a Control-C character to the Unix system, which will interpret it as intr. Command-C copies the highlight text to the clipboard.

Was this luck? Hard to tell, but I think not. The Mac was conceived as an entire product. It therefore seems completely reasonable to design a keyboard just for that product, and completely reasonable to create a new key for doing Mac things. Windows was conceived as an operating system. It had to run on existing hardware.

Why didn’t Apple just stick a Control key on the keyboard and use that? Well, that would have been silly, and difficult to explain. Control had an existing use (on a terminal to send control characters that generally affected either the connection or the movement of the print head) which had nothing to do with GUI commands. Better to name the key that is used for keyboard shortcuts for GUI commands, Command.

Why didn’t Windows do this? Microsoft weren’t in a position where they could dictate what keys appeared on the keyboard. They had to make do with whatever was there. Why Control-C for Copy instead of Alt-C, say? I have no idea. It seems particularly mysterious since, at the time Windows was layered on top of DOS and so was used by people familiar with DOS and in DOS, at least at the command prompt, Control-C meant interrupt, like it does in Unix.

The lucky bit seems to be that Apple ever put a Control key on the keyboard at all; early keyboards didn’t have one, but I guess as soon as you have a modem you can connect to some other computer and that will require a Control key for some things. But basically the Control key on a Mac hung around for 2 decades doing nothing much until OS X comes out and then it’s used in all the traditional Unix ways.

What did Unix do when it got a window system? Well, on X Windows in the mid nineties I don’t remember using a keyboard shortcut for Copy. In fact, I don’t remember using Copy at all. The middle mouse button (most Unix workstations and X servers of the time had mice with 3 or more buttons) would paste the highlighted text into whichever window the mouse was pointing at (note: the highlighted text could be in a different window from the one which receives the pasted text). No separate Copy command, just highlight text in one window, and click middle button to paste it into another window. Cute.

Windows has a Windows button now (and, much to my surprise has had it since Windows 95). Obviously there was no way that Copy could be moved from Control-C to Windows-C because Control-C was already welded into the minds of all Windows users. So the Windows button generally does useless things in Windows.

What about Ubuntu? They could’ve decided that it was reasonable to assume that all keyboards would have the Windows key. They could’ve decided that all GUI shortcut commands would use the Windows key. Freeing up Control-C to mean intr. You would’ve though that for a GUI running on Unix this would be an important consideration. But no, Ubuntu thoughtlessly copies windows and Control-C means Copy. Unless you’re in the terminal emulator, Terminal, in which case it sends a Control-C character and you have to remember to use Shift-Control-C for Copy. And just like Windows, the Windows key is useless in Ubuntu.

It all just seems like a lost opportunity to me.


Making change with shell

2012-03-13

I was flicking through Wikström’s «Functional Programming Using Standard ML», when I noticed he describes the problem of making up change for an amount of money m using coins of certain denominations (18.1.2, page 233). He says we “want a function change that given an amount finds the smallest number of coins that adds to that amount”, and “Obviously, you first select the biggest possible coin”. Here’s his solution in ML:

exception change;
fun change 0 cs = nul
  | change m nil = raise change
  | change m (ccs as c::cs) = if m >= c
      then c::change (m-c) ccs
      else change m cs;

It’s pretty neat. The recursion proceeds by either reducing the magnitude of the first argument (the amount we are giving change for), or reducing the size of the list that is the second argument (the denominations of the coins we can use); so we can tell that the recursion must terminate. Yay.

It’s not right though. Well, it gives correct change, but it doesn’t necessarily find the solution with fewest number of coins. Actually, it depends on the denominations of coins in our currency; probably for real currencies the “biggest coin first” algorithm does in fact give the fewest number of coins, but consider the currency used on the island of san side-effect, the lambda. lambdas come in coins of Λ1, Λ10, and Λ25 (that’s not a wedge, it’s a capital lambda. It’s definitely not a fake A).

How do we give change of Λ30? { Λ10, Λ10, Λ10 } (3 tens); what does Wikström’s algorithm give? 1 twenty-five and 5 ones. Oops.

I didn’t work out a witty solution to the fewest number of coins change, but I did create, in shell, a function that lists all the possible ways of making change. Apart from trivial syntactic changes, it’s not so different from the ML:

_change () {
    # $1 is the amount to be changed;
    # $2 is the largest coin;
    # $3 is a comma separated list of the remaining coins

    if [ "$1" -eq 0 ] ; then
        echo ; return
    fi
    if [ -z "$2" ] ; then
        return
    fi
    _change $1 ${3%%,*} ${3#*,}
    if [ "$1" -lt "$2" ] ; then
        return
    fi
    _change $(($1-$2)) $2 $3 |
        while read a ; do
            echo $2 $a
        done
}

change () {
    _change $1 ${2%%,*} ${2#*,},
}

Each solution is output as a single line with the coins used in a space separated list. change is a wrapper around _change which does the actual work. The two base cases are basically identical: «”$1″ -eq 0» is when we have zero change to give, and we output an empty line (just a bare echo) which is our representation for the empty list; «-z “$2″» is when the second argument (the first element of the list of coins) is empty, and, instead of raising an exception, we simply return without output any list at all.

The algorithm to generate all possible combinations of change is only very slightly different from Wikström’s: if we can use the largest coin, then we generate change both without using the largest coin (first recursive call to _change, on line 12) and using the largest coin (second recursive call to _change, on line 16). See how we use a while loop to prepend (cons, if you will) the largest coin value to each list returned by the second recursive call. Of course, when the largest coin is too large then we proceed without it, and we only have the first recursive call.

The list of coins is managed as two function arguments. $2 is the largest coin, $3 is a comma separated list of the remaining coins (including a trailing comma, added by the wrapping change function). See how, in the first recursive call to _change $3 is decomposed into a head and tail with ${3%%,*} and ${3#*,}. As hinted at in the previous article, «%%» is a greedy match and removes the largest suffix that matches the pattern «,*» which is everything from the first comma to the end of the string, and so it leaves the first number in the comma separated list. «#» is a non-greedy match and removes the smallest prefix that matches the pattern «*,», so it removes the first number and its comma from the list. Note how I am assuming that all the arguments do not contain spaces, so I am being very cavalier with double quotes around my $1 and $2 and so on.

It even works:

$ change 30 25,10,1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10 10 1 1 1 1 1 1 1 1 1 1
10 10 10
25 1 1 1 1 1

Taking the bash out of Mark

2012-03-05

Mark Dominus, in his pretty amusing article about exact rational arithmetic in shell gives us this little (and commented!) shell function:

        # given an input number which might be a decimal, convert it to
        # a rational number; set n and d to its numerator and
        # denominator.  For example, 3.3 becomes n=33 and d=10;
        # 17 becomes n=17 and d=1.
        to_rational() {
          # Crapulent bash can't handle decimal numbers, so we will convert
          # the input number to a rational
          if [[ $1 =~ (.*)\.(.*) ]] ; then
              i_part=${BASH_REMATCH[1]}
              f_part=${BASH_REMATCH[2]}
              n="$i_part$f_part";
              d=$(( 10 ** ${#f_part} ))
          else
              n=$1
              d=1
          fi
        }

Since I’m on a Korn overdrive, what would this look like without the bashisms? Dominus uses BASH_REMATCH to split a decimal fraction at the decimal point, thus splitting ‘fff.iii’ into ‘fff’ and ‘iii’. That can be done using portable shell syntax (that is, blessed by the Single Unix Specification) using the ‘%’ and ‘#’ features of parameter expansion. Example:

$ f=3.142
$ echo ${f%.*}
3
$ echo ${f#*.}
142

In shell, «${f}» is the value of the variable (parameter) f; you probably knew that. «${f%pattern}» removes any final part of f that matches pattern (which is a shell pattern, not a regular expression). «${f#pattern}» removes any initial part of f that matches pattern (full technical details: they remove the shortest match; use %% and ## for greedy versions).

Thus, between them «${f%.*}» and «${f#*.}» are the integer part and fractional part (respectively) of the decimal fraction. The only problem is when the number has no decimal point. Well, Dominus special cased that too. Of course the “=~” operator is a bashism (did perl inspire bash, or the other way around?), so portable shell programmers have to use ‘case’ (which traditionally was always preferred even when ‘[‘ could be used because ‘case’ didn’t fork another process). At least this version features a secret owl hidden away (on line 3):

to_rational () {
  case $1 in
    (*.*) i_part=${1%.*} f_part=${1#*.}
      n="$i_part$f_part"
      d=$(( 10 ** ${#f_part} )) ;;
    (*) n=$1 d=1 ;;
  esac
}

The ‘**’ in the arithmetic expression raised a doubt in my mind and, *sigh*, it turns out that it’s not portable either (it does work in ‘ksh’, but it’s not in the Single Unix Specification). Purists have to use a while loop to add a ‘0’ digit for every digit removed from f_part:

to_rational () {
  case $1 in
    (*.*) i_part=${1%.*} f_part=${1#*.}
      n="$i_part$f_part"
      d=1;
      while [ -n "${f_part}" ] ; do
          d=${d}0
          f_part=${f_part%?}
      done ;;
    (*) n=$1 d=1 ;;
  esac
}

Traditional shell didn’t support this «${f%.*}» stuff, but it’s been in Single Unix Specification for ages. It’s been difficult to find a Unix with a shell that didn’t support this syntax since about the year 2000. It’s time to start to be okay about using it.


Follow

Get every new post delivered to your Inbox.