Resilient Haskell Software

Lessons learned about bitrot in Haskell software
Haskell, R, archiving, technology, computer-science
2008-09-262011-02-12 in progress certainty: likely importance: 4

In 2007, Haskell be­gan to ex­pe­ri­ence some real growth; one of the new users was my­self. The old ways of in­di­vid­ual dis­tri­b­u­tion with con­fig­u­ra­tion & in­stal­la­tion weren’t state of the dis­tri­b­u­tion art; the shin­ing ex­am­ple was Per­l’s . At about that time, Dun­can Coutts and a few oth­ers pro­posed some­thing sim­i­lar called Ca­bal.1

I was look­ing around for a way to help out the Haskell com­mu­ni­ty, and was im­pressed by the idea of Ca­bal. It would make it eas­ier for Linux dis­tri­b­u­tions to pack­age Haskell li­braries & ex­e­cuta­bles (and has—wit­ness how dis­tro pack­ages are au­to­mat­i­cally gen­er­ated from the Ca­bal pack­ages). But more im­por­tant­ly, by mak­ing de­pen­den­cies eas­ier for au­thors to rely on (or even­tu­ally au­to­mat­i­cally in­stal­lable thanks to ca­bal-in­stall), Ca­bal would dis­cour­age du­pli­ca­tion and en­cour­age split­ting out li­braries. An ap­pli­ca­tion like the Gi­tit wiki, with its >40 de­pen­den­cies, would sim­ply be un­think­able. It may be hard for Haskellers who started after 2009 to be­lieve, but ap­pli­ca­tions would often keep in the source repos­i­tory copies of what­ever li­braries they need­ed—if the li­braries weren’t sim­ply copied di­rectly into the ap­pli­ca­tion’s source code. (In a case I re­mem­ber be­cause I did much of the work to fix it, Darcs used a prim­i­tive lo­cal ver­sion of bytestring for years after bytestring’s re­lease.)

Un­for­tu­nate­ly, Ca­bal’s up­take struck me as slow. My be­lief seems to be sup­ported by the up­load log. In 2006, there are 46 up­loads by 5 peo­ple; the up­loads are mostly of the ‘boot’ li­braries dis­trib­uted with GHC like mtl or net­work or unix. 2007 shows a much bet­ter up­take, with 586 up­loads (not 586 pack­ages, since many pack­ages had mul­ti­ple ver­sions up­load­ed) by 100 peo­ple. (I was re­spon­si­ble for 4 up­load­s.)

So I de­cided to spend most of my time Ca­bal­iz­ing pack­ages or nag­ging main­tain­ers to up­load. Ul­ti­mately I did 150 up­loads in 2008. (A dis­cus­sion of my Haskell and up­load­ing ac­tiv­i­ties can be found on the about me page.) In to­tal, Hack­age saw 2307 up­loads by 250 peo­ple. In 2009, there were 3671 up­loads by 391 peo­ple; in 2010, there were 5174 up­loads by 490 peo­ple. Long story short, Ca­bal has de­ci­sively de­feated Au­to­tools in the Haskell uni­verse, and is the stan­dard; the only hold­outs are legacy projects too com­plex to Ca­bal­ize (like GHC) or re­fuseniks (like David Roundy and Jon Meachem). I flat­ter my­self that my work may have sped up Ca­bal adop­tion & de­vel­op­ment.

As you can imag­ine, I ran into many Ca­bal lim­i­ta­tions and bugs as I bravely went where no Ca­bal­ist went be­fore, but most of them have since been fixed. I also worked on some very old code (one game dated back to 1997 or so) and learned more than I wanted to about how old Haskell code could bi­trot.

I learned (I hope) some good prac­tices that would help re­duce bi­trot. In or­der of im­por­tance, they were:

  • Ca­bal­iza­tion and meta­data is good. This ties into the old de­clar­a­tive vs im­per­a­tive ap­proach—a Make­file can be do­ing all sorts of bizarre IO and pe­cu­liar script­ing. It’s one thing to un­der­stand a README which men­tions that such and such a file needs to have a field edit­ed, and that the LaTeX and man pages should be gen­er­ated from the LaTeX doc­u­men­ta­tion; but it’s quite an­other to un­der­stand a Make­file which uses baroque shell one-lin­ers to do the same thing. The for­mer has a hope of be­ing eas­ily con­verted to al­ter­na­tive pack­ag­ing and make sys­tems, and the lat­ter—­does­n’t.
  • Un­less there’s a very good rea­son, not us­ing Darcs, Ca­bal, and GHC is only hurt­ing your­self. Those three are cur­rently “too big to fail”.
  • Fix -Wall warn­ings as often as pos­si­ble. What is to­day merely im­per­fect style can later be real in­tractable er­rors.
  • Any­thing which has a cus­tom Setup.hs or which touches the GHC API is death to main­tain! I can­not em­pha­size this enough, these bits of func­tion­al­ity bi­trot like mad. Graph­ics li­braries are quite bad, but the GHC and Ca­bal in­ter­nals are even more un­sta­ble. This is not nec­es­sar­ily a bad thing; the Linux ker­nel de­vel­op­ers have a sim­i­lar fa­mous phi­los­o­phy ar­tic­u­lated as why you don’t want a sta­ble bi­nary API. But it is defi­nitely some­thing to bear in mind.
  • It may seem anal to ex­plic­itly enu­mer­ate im­ports (ie. import Data.List (nub)), par­tic­u­larly given how this can re­strain flex­i­bil­ity and cause an­noy­ing com­pile prob­lem­s—but much lat­er, enu­mer­ated im­ports are in­cred­i­bly valu­able. Ten years from now, the Haskeller look­ing may have no idea what this Linspire.Debian mod­ule is. You may say, just com­ment out im­ports one by one and see what goes miss­ing. But what if there are a dozen other things bro­ken, or dozens of im­ported mod­ules? The cas­cade of com­plex­i­ties can de­feat sim­plis­tic tech­niques like that. And you re­ally have no choice: im­ports are one of the very first things which get checked by the com­pil­er. If they don’t work out, you’ll be stopped dead right there. There are other ben­e­fits of course: you sig­nifi­cantly re­duce the change of am­bigu­ous im­ports, and dead code be­comes much eas­ier to find. (This can be taken too far, of course—it usu­ally makes lit­tle sense to ex­plic­itly im­port and enu­mer­ate stuff from the Pre­lude.)
  • An­other styl­is­tic point is that func­tions de­fined in where-clauses can eas­ily ac­ci­den­tally use more vari­ables than they are given as ar­gu­ments. This can lead to nice-look­ing code in­deed, but it can make de­bug­ging diffi­cult lat­er: the type sig­na­tures are usu­ally omit­ted in where-claus­es. Sup­pose you need them? You will have diffi­culty hoist­ing the lo­cal de­fi­n­i­tion out to the top lev­el, where you can ac­tu­ally see what type is be­ing in­ferred and how it con­flicts with what type is need­ed.
  • Code can hang around for a very long time. It is short­-sighted to not pre­serve code some­where else. I ran into some par­tic­u­larly egre­gious ex­am­ples where not only had the site gone down, tak­ing with it the code, but their robots.txt had specifi­cally dis­al­lowed the from back­ing up their site! I per­son­ally re­gard such ac­tions as spit­ting in the face of the com­mu­ni­ty, since we all stand on each oth­er’s toes, as the ex­pres­sion goes. There are no truly in­de­pen­dent works.
  • In a sim­i­lar vein, we should re­mem­ber open­ness is a good thing! Open­ness is such an im­por­tant thing that en­tire op­er­at­ing sys­tems are cre­ated just for it. Why did fork from and take that name? Was it just be­cause of bad blood in the core de­vel­op­ment team? No, every­one else went along be­cause OpenBSD took a stand and made its CVS repos­i­to­ries open. Open repos­i­to­ries en­cour­age con­tri­bu­tion as lit­tle else short of a Free li­cense does; if you keep the repos­i­tory pri­vate, peo­ple will al­ways worry about du­pli­cated and wasted work, about re­jected patch­es, about miss­ing files nec­es­sary for com­pi­la­tion and de­vel­op­ment but not for re­lease and sim­ple us­age. Open re­pos in­vite peo­ple to con­tribute, they al­low your code to be au­to­mat­i­cally archived by search en­gi­nes, the In­ter­net Archive, and your fel­low coders.
  • Li­cens­ing in­for­ma­tion is a must! A cus­tom li­cense is as bad as a cus­tom Setup.hs, in a way. It is hard to add into files, which in­creases un­cer­tainty and le­gal risk for every­one in­ter­ested in pre­serv­ing it. Which are you more likely to work on and dis­trib­ute: a file which says in the header “Li­cense: GPL”, noth­ing at all, or even worse, “see LICENSE for the crazy li­cense I in­vented while on a drunken fender ben­der”?
  • Be­sides avoid­ing writ­ing non-Free soft­ware, do not de­pend on non-Free soft­ware. “In the long run, the util­ity of all non-Free soft­ware ap­proaches ze­ro. All non-Free soft­ware is a dead end.”2 Non-free soft­ware in­her­ently lim­its the pool of peo­ple al­lowed to mod­ify it, has­ten­ing the day it fi­nally dies. A pro­gram can­not afford to be picky, or to in­tro­duce any fric­tion what­so­ev­er. In a real sense, bad or un­clear li­cens­ing is ac­ci­den­tal com­plex­i­ty. There’s enough of that in the world as it is.
  • A pro­gram which con­fines it­self to Haskel­l’98 and noth­ing out­side the base li­braries can last a long time; just the other day, I was sal­vaging the Quake­Haskell code from 1996/1997. Once the file names were matched to the mod­ule names, most of the code com­piled fine. Sim­i­lar­ly, I took Haskell in Space from 2001, and all I had to do was up­date where the HGL func­tions were be­ing im­ported from. A corol­lary to this is that code us­ing just the Pre­lude is effec­tively im­mor­tal.
  • In­clude de­riva­tions! It’s per­fectly fine to use clever tech­niques and de­fi­n­i­tions, such as rleDecode = (uncurry replicate =<<) for de­cod­ing run-length en­coded lists of tu­ples3, but in the com­ments, in­clude the orig­i­nal gi­ant de­fi­n­i­tion which you pro­gres­sively re­fined into a short di­a­mond! Even bet­ter, add a test (like a QuickCheck prop­er­ty) where you demon­strate that the out­put from the two are the same. If you are op­ti­miz­ing, some­where hold onto the slow ones which you know are cor­rect. De­riva­tions are bril­liant doc­u­men­ta­tion of your in­tent, they pro­vide nu­mer­ous al­ter­nate im­ple­men­ta­tions which might work if the cur­rent one breaks, and they give the fu­ture Haskellers a view of how you were think­ing.
  • Avoid bind­ing to C++ soft­ware if pos­si­ble. I once tried to ca­bal­ize Qthaskell, which binds to the QT GUI li­brary. Ap­par­ent­ly, you can only link to a C++ li­brary after gen­er­at­ing a C in­ter­face, and the pro­ce­dure for this is non-portable, un­re­li­able, and defi­nitely not well-sup­ported by Ca­bal.

  1. Strictly speak­ing, Ca­bal was first pro­posed in 2003 and a pa­per pub­lished in 2005, but 2007 was when there was run­ning code that could cred­i­bly be the sole method of in­stal­la­tion, and not an ex­per­i­men­tal al­ter­na­tive which one might use in par­al­lel with Au­to­tools.↩︎

  2. Mark Pil­grim, “Free­dom 0”; he is quot­ing/­para­phras­ing Matthew Thomas↩︎

  3. For an ex­pla­na­tion of this task, and how that monad stuff works, see ↩︎