Resilient Haskell Software

Lessons learned about bitrot in Haskell software
Haskell, R, archiving, technology, computer-science
2008-09-262011-02-12 in progress certainty: likely importance: 4

In 2007, Haskell began to expe­ri­ence some real growth; one of the new users was myself. The old ways of indi­vid­ual dis­tri­b­u­tion with con­fig­u­ra­tion & instal­la­tion weren’t state of the dis­tri­b­u­tion art; the shin­ing exam­ple was Per­l’s . At about that time, Dun­can Coutts and a few oth­ers pro­posed some­thing sim­i­lar called Cabal.1

I was look­ing around for a way to help out the Haskell com­mu­ni­ty, and was impressed by the idea of Cabal. It would make it eas­ier for Linux dis­tri­b­u­tions to pack­age Haskell libraries & exe­cuta­bles (and has—wit­ness how dis­tro pack­ages are auto­mat­i­cally gen­er­ated from the Cabal pack­ages). But more impor­tant­ly, by mak­ing depen­den­cies eas­ier for authors to rely on (or even­tu­ally auto­mat­i­cally instal­lable thanks to cabal-in­stall), Cabal would dis­cour­age dupli­ca­tion and encour­age split­ting out libraries. An appli­ca­tion like the Gitit wiki, with its >40 depen­den­cies, would sim­ply be unthink­able. It may be hard for Haskellers who started after 2009 to believe, but appli­ca­tions would often keep in the source repos­i­tory copies of what­ever libraries they need­ed—if the libraries weren’t sim­ply copied directly into the appli­ca­tion’s source code. (In a case I remem­ber because I did much of the work to fix it, Darcs used a prim­i­tive local ver­sion of bytestring for years after bytestring’s release.)

Unfor­tu­nate­ly, Cabal’s uptake struck me as slow. My belief seems to be sup­ported by the upload log. In 2006, there are 46 uploads by 5 peo­ple; the uploads are mostly of the ‘boot’ libraries dis­trib­uted with GHC like mtl or net­work or unix. 2007 shows a much bet­ter uptake, with 586 uploads (not 586 pack­ages, since many pack­ages had mul­ti­ple ver­sions upload­ed) by 100 peo­ple. (I was respon­si­ble for 4 upload­s.)

So I decided to spend most of my time Cabal­iz­ing pack­ages or nag­ging main­tain­ers to upload. Ulti­mately I did 150 uploads in 2008. (A dis­cus­sion of my Haskell and upload­ing activ­i­ties can be found on the about me page.) In total, Hack­age saw 2307 uploads by 250 peo­ple. In 2009, there were 3671 uploads by 391 peo­ple; in 2010, there were 5174 uploads by 490 peo­ple. Long story short, Cabal has deci­sively defeated Auto­tools in the Haskell uni­verse, and is the stan­dard; the only hold­outs are legacy projects too com­plex to Cabal­ize (like GHC) or refuseniks (like David Roundy and Jon Meachem). I flat­ter myself that my work may have sped up Cabal adop­tion & devel­op­ment.

As you can imag­ine, I ran into many Cabal lim­i­ta­tions and bugs as I bravely went where no Cabal­ist went before, but most of them have since been fixed. I also worked on some very old code (one game dated back to 1997 or so) and learned more than I wanted to about how old Haskell code could bitrot.

I learned (I hope) some good prac­tices that would help reduce bitrot. In order of impor­tance, they were:

  • Cabal­iza­tion and meta­data is good. This ties into the old declar­a­tive vs imper­a­tive approach—a Make­file can be doing all sorts of bizarre IO and pecu­liar script­ing. It’s one thing to under­stand a README which men­tions that such and such a file needs to have a field edit­ed, and that the LaTeX and man pages should be gen­er­ated from the LaTeX doc­u­men­ta­tion; but it’s quite another to under­stand a Make­file which uses baroque shell one-lin­ers to do the same thing. The for­mer has a hope of being eas­ily con­verted to alter­na­tive pack­ag­ing and make sys­tems, and the lat­ter—­does­n’t.
  • Unless there’s a very good rea­son, not using Darcs, Cabal, and GHC is only hurt­ing your­self. Those three are cur­rently “too big to fail”.
  • Fix -Wall warn­ings as often as pos­si­ble. What is today merely imper­fect style can later be real intractable errors.
  • Any­thing which has a cus­tom Setup.hs or which touches the GHC API is death to main­tain! I can­not empha­size this enough, these bits of func­tion­al­ity bitrot like mad. Graph­ics libraries are quite bad, but the GHC and Cabal inter­nals are even more unsta­ble. This is not nec­es­sar­ily a bad thing; the Linux ker­nel devel­op­ers have a sim­i­lar famous phi­los­o­phy artic­u­lated as why you don’t want a sta­ble binary API. But it is defi­nitely some­thing to bear in mind.
  • It may seem anal to explic­itly enu­mer­ate imports (ie. import Data.List (nub)), par­tic­u­larly given how this can restrain flex­i­bil­ity and cause annoy­ing com­pile prob­lem­s—but much lat­er, enu­mer­ated imports are incred­i­bly valu­able. Ten years from now, the Haskeller look­ing may have no idea what this Linspire.Debian mod­ule is. You may say, just com­ment out imports one by one and see what goes miss­ing. But what if there are a dozen other things bro­ken, or dozens of imported mod­ules? The cas­cade of com­plex­i­ties can defeat sim­plis­tic tech­niques like that. And you really have no choice: imports are one of the very first things which get checked by the com­pil­er. If they don’t work out, you’ll be stopped dead right there. There are other ben­e­fits of course: you sig­nifi­cantly reduce the change of ambigu­ous imports, and dead code becomes much eas­ier to find. (This can be taken too far, of course—it usu­ally makes lit­tle sense to explic­itly import and enu­mer­ate stuff from the Pre­lude.)
  • Another styl­is­tic point is that func­tions defined in where-clauses can eas­ily acci­den­tally use more vari­ables than they are given as argu­ments. This can lead to nice-look­ing code indeed, but it can make debug­ging diffi­cult lat­er: the type sig­na­tures are usu­ally omit­ted in where-claus­es. Sup­pose you need them? You will have diffi­culty hoist­ing the local defi­n­i­tion out to the top lev­el, where you can actu­ally see what type is being inferred and how it con­flicts with what type is need­ed.
  • Code can hang around for a very long time. It is short­-sighted to not pre­serve code some­where else. I ran into some par­tic­u­larly egre­gious exam­ples where not only had the site gone down, tak­ing with it the code, but their robots.txt had specifi­cally dis­al­lowed the from back­ing up their site! I per­son­ally regard such actions as spit­ting in the face of the com­mu­ni­ty, since we all stand on each oth­er’s toes, as the expres­sion goes. There are no truly inde­pen­dent works.
  • In a sim­i­lar vein, we should remem­ber open­ness is a good thing! Open­ness is such an impor­tant thing that entire oper­at­ing sys­tems are cre­ated just for it. Why did fork from and take that name? Was it just because of bad blood in the core devel­op­ment team? No, every­one else went along because OpenBSD took a stand and made its CVS repos­i­to­ries open. Open repos­i­to­ries encour­age con­tri­bu­tion as lit­tle else short of a Free license does; if you keep the repos­i­tory pri­vate, peo­ple will always worry about dupli­cated and wasted work, about rejected patch­es, about miss­ing files nec­es­sary for com­pi­la­tion and devel­op­ment but not for release and sim­ple usage. Open repos invite peo­ple to con­tribute, they allow your code to be auto­mat­i­cally archived by search engi­nes, the Inter­net Archive, and your fel­low coders.
  • Licens­ing infor­ma­tion is a must! A cus­tom license is as bad as a cus­tom Setup.hs, in a way. It is hard to add into files, which increases uncer­tainty and legal risk for every­one inter­ested in pre­serv­ing it. Which are you more likely to work on and dis­trib­ute: a file which says in the header “License: GPL”, noth­ing at all, or even worse, “see LICENSE for the crazy license I invented while on a drunken fender ben­der”?
  • Besides avoid­ing writ­ing non-Free soft­ware, do not depend on non-Free soft­ware. “In the long run, the util­ity of all non-Free soft­ware approaches zero. All non-Free soft­ware is a dead end.”2 Non-free soft­ware inher­ently lim­its the pool of peo­ple allowed to mod­ify it, has­ten­ing the day it finally dies. A pro­gram can­not afford to be picky, or to intro­duce any fric­tion what­so­ev­er. In a real sense, bad or unclear licens­ing is acci­den­tal com­plex­i­ty. There’s enough of that in the world as it is.
  • A pro­gram which con­fines itself to Haskel­l’98 and noth­ing out­side the base libraries can last a long time; just the other day, I was sal­vaging the Quake­Haskell code from 1996/1997. Once the file names were matched to the mod­ule names, most of the code com­piled fine. Sim­i­lar­ly, I took Haskell in Space from 2001, and all I had to do was update where the HGL func­tions were being imported from. A corol­lary to this is that code using just the Pre­lude is effec­tively immor­tal.
  • Include deriva­tions! It’s per­fectly fine to use clever tech­niques and defi­n­i­tions, such as rleDecode = (uncurry replicate =<<) for decod­ing run-length encoded lists of tuples3, but in the com­ments, include the orig­i­nal giant defi­n­i­tion which you pro­gres­sively refined into a short dia­mond! Even bet­ter, add a test (like a QuickCheck prop­er­ty) where you demon­strate that the out­put from the two are the same. If you are opti­miz­ing, some­where hold onto the slow ones which you know are cor­rect. Deriva­tions are bril­liant doc­u­men­ta­tion of your intent, they pro­vide numer­ous alter­nate imple­men­ta­tions which might work if the cur­rent one breaks, and they give the future Haskellers a view of how you were think­ing.
  • Avoid bind­ing to C++ soft­ware if pos­si­ble. I once tried to cabal­ize Qthaskell, which binds to the QT GUI library. Appar­ent­ly, you can only link to a C++ library after gen­er­at­ing a C inter­face, and the pro­ce­dure for this is non-portable, unre­li­able, and defi­nitely not well-sup­ported by Cabal.

  1. Strictly speak­ing, Cabal was first pro­posed in 2003 and a paper pub­lished in 2005, but 2007 was when there was run­ning code that could cred­i­bly be the sole method of instal­la­tion, and not an exper­i­men­tal alter­na­tive which one might use in par­al­lel with Auto­tools.↩︎

  2. Mark Pil­grim, “Free­dom 0”; he is quoting/paraphrasing Matthew Thomas↩︎

  3. For an expla­na­tion of this task, and how that monad stuff works, see ↩︎