Golfing Run Length Encoding in Haskell

Haskell: step by step refactoring to concision
Haskell, tutorial
2008-09-262013-01-16 finished certainty: highly likely importance: 2

Once I was play­ing with and work­ing on a Haskell clone of the old arcade games, Mona­dius. Most of my changes were not par­tic­u­larly inter­est­ing (clean­ing up, Cabal­iz­ing, fix­ing warn­ings, switch­ing all Integer to Int and so on), but in its Demo.hs, I found an inter­est­ing solu­tion to a prob­lem, and it seems like a good small exam­ple of Haskell abstrac­tion.


One of the prob­lems with lan­guage advo­cacy is that it’s hard to have good exam­ples, since some­one who is writ­ing an essay prob­a­bly will only come up with triv­ial ones, and some­one deal­ing with meaty prob­lems isn’t think­ing about advo­cacy but how to solve their prob­lem. I hope this exam­ple will be a lit­tle more sub­stan­tial than the one-lin­ers exam­ples of or .

Level data in Monadius

Sup­pose we have lev­els which are spec­i­fied by a pair of num­bers and then a long list of num­bers, often very rep­e­ti­tious. Per­haps a par­tic­u­lar level might be rep­re­sented this way:

level1 :: ((Int, Int), [Int])
level1 = ((2,1),[0,0,0,0,0,0,0,0,0,0,0,0,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,

This is an ugly way to define a lev­el. We could just scrap this rep­re­sen­ta­tion as Int com­pletely and per­haps define it using a hypo­thet­i­cal as

level1 :: ((Geometry,Geometry),[Enemy])
level1 = ((Tall,Narrow), [Flyer,Flyer,Flyer,Flyer,Flyer,Flyer,Shooter,PowerUp,Boss..])

But this rep­re­sen­ta­tion, while cer­tainly more under­stand­able, is still very rep­e­ti­tious. It will take even more space to write down.

We could auto-derive the Read and Show type­classes for Geometry and Enemy datatypes, and then we could store level1, level2 etc. as some sort of “lev­elin­for­ma­tion.­dat” file, and read the level infor­ma­tion in at run­time.

But that makes instal­la­tion and run­ning more diffi­cult1 still quite pos­si­ble, , and it does­n’t address the core issue: the infor­ma­tion is writ­ten in a way that is so ungainly that it’s next to impos­si­ble to write or even mod­ify!

The solution

We need some way of express­ing this more con­cise­ly, of, in short, com­press­ing it. In this vein of thought, our first obser­va­tion should be that we do not need to resort to fancy com­pres­sion libraries or any­thing; there is an obvi­ous way to com­press it—the entire thing is a series of repeated num­bers.

We should be able to replace the lengthy enu­mer­a­tion of frag­ments like [0,0,0,0,0,0,0,0,0,0,0,0] with some­thing sim­pler. For exam­ple, the num­ber of rep­e­ti­tions, and what is to be repeat­ed: (12,0).

This is shorter and eas­ier to mod­i­fy. It is pos­si­ble that there may be a per­for­mance ben­e­fit here, as we’ve got­ten rid of a large con­stant that would have to be defined in the pro­gram itself and instead replaced it with a shorter func­tion which eval­u­ates to the same thing. (If we’re only on level 1, we don’t need to carry around the expanded ver­sion of the other lev­els, and when we go to level 2, level 1 will be garbage-col­lect­ed.)

Writing it

So, what is the type of our decom­press­ing func­tion? Well, we need to turn a (Int,Int) into a [Int]; even bet­ter, we want to turn a whole list of (Int,Int)s into a sin­gle list of Ints. Thus, our end goal is going to be a func­tion of this type:

rleDecode :: [(Int,Int)] -> [Int]

Let us tackle the sin­gle tuple exam­ple first. The sec­ond entry defines what we need, and the first entry defines how many we need, and that’s our type right there:

rleRecursiveDecode :: (Int, a) -> [a]

We could write a func­tion that takes the para­me­ter, decreases the counter by 1, and cons on a copy of the sec­ond entry. It could look like this:

rleRecursiveDecode (0,_) = []
rleRecursiveDecode (n,b) = b : (rleRecursiveDecode (n-1,b))

But this really is not the best way to go. It is not nec­es­sar­ily easy to fol­low, and if there is one thing I have learned about Haskell pro­gram­ming, it is that the most obvi­ous approach (in this case, prim­i­tive recur­sion) may not be the best way to go.

This is code that could have as well been writ­ten in Scheme or some­thing. It is com­pli­cated because we are try­ing to ensure that we gen­er­ate an item for the list only at the exact moment we need to add it into the list; we are pro­gram­ming as if our lan­guage is strict, in other words. So, what is the lazy way of doing things? Why do we like lazi­ness to begin with?

One of the key ratio­nale for is that it pro­motes —one func­tion does­n’t need to know when or how much has been done by another func­tion. In this case, what con­cerns ought to be sep­a­rat­ed? Well, our desired list has 2 prop­er­ties: its length, and the sin­gle repeated item.

Since we can cre­ate infi­nite lists, there is no need for the length to be set at the same time as the repeat­ed-item is being gen­er­at­ed. We can cre­ate an infi­nite list con­tain­ing the item, and we demand as many as we need. Lazy eval­u­a­tion means our infi­nite list does­n’t blow up. Sim­ple enough!

The gen­er­a­tion is too easy for words:

-- Define an infinite list of our item.
duplicate :: a -> [a]
duplicate a = a : duplicate a

But what’s our demand func­tion? A lit­tle think­ing and we know that we have a list, a num­ber, and we get back a list. Int -> [a] -> [a] describes a few func­tions, the sec­ond of which turns out to be what we want: take.

rleLazyDecode :: (Int,Int) -> [Int]
rleLazyDecode (n,b) = take n (duplicate b)

Now, duplicate is a sim­ple enough func­tion to define, but another type­-search will show that a -> [a] is already defined in the stan­dard libraries—repeat.

So we scrap duplicate:

rleLazyDecode (n,b) = take n (repeat b)

Might as well remove the paren­the­ses:

rleLazyDecode (n,b) = take n $ repeat b

We can do bet­ter. What is the type of \n b -> take n $ repeat b? It is Int -> a -> [a], and a Hoogle type­-search turns up an exact match: repli­cate. So we can improve it fur­ther:

rleLazyDecode (n,b) = replicate n b

Are we quite done? Well, one could ask: rleLazyDecade is now basi­cally noth­ing but an alias or new name for replicate, except for how it con­verts from a tuple (n,b) to nor­mal uncur­ried argu­ments n b. Is there any func­tion which will do this con­ver­sion for us? Well, we have a func­tion replicate :: a -> b -> c, and data (a,b) passed into said func­tion to pro­duce a c, so we want a func­tion with the type sig­na­ture (a -> b -> c) -> (a,b) -> c, which turns out to be some­thing called uncurry. Hav­ing rea­soned our way this far, the imple­men­ta­tion is much clearer than the type:

uncurry :: (a -> b -> c) -> (a, b) -> c
uncurry f (a,b) = f a b

That’s what we were just doing man­u­al­ly. So now the next ver­sion is clear:

rleLazyDecode = uncurry replicate

A sat­is­fy­ing, short, func­tion­al, and lazy one-lin­er. From here the defi­n­i­tion of rleDecode is almost triv­ial: we extend it to a list of tuples by throw­ing in a map, and we turn the result­ing list of lists into a sin­gle list by way of con­cat:

rleDecode ns = concat $ map rleLazyDecode ns

We can tweak this fur­ther, as concat . map is a com­mon enough idiom that there is a short­cut:

rleDecode ns = concatMap rleLazyDecode ns

As before, we could have found concatMap by search­ing for the type sig­na­ture of concat . map: (a1 -> [a]) -> [a1] -> [a]. Aw heck­—let’s make it like rleLazyDecode:

rleDecode = concatMap rleLazyDecode

And then we sub­sti­tute in the rleLazyDecode defi­n­i­tion:

rleDecode = concatMap (uncurry replicate)

We can actu­ally go even fur­ther into the realms of incom­pre­hen­si­bil­i­ty. It turns out that lists are a mon­ad. This means we can use bind and all the rest of the oper­a­tions defined by the Monad type­class to oper­ate on lists and other things. So we can write this bizarre (but short and effec­tive) ver­sion of rleDecode:

rleDecode = (uncurry replicate =<<)

And we are done! We can now rep­re­sent the first level like this:

d1 = ((2,1),d)
  where d = rleDecode [(5, 3), (10, 67), (6, 3), (5, 67), (29, 3), (2, 11),
     (29, 3), (5, 35), (3, 3), (3, 35), (24, 3), (2, 35), (19, 3), (7, 19), (21, 51),
     (63, 35), (15, 43), (5, 11), (8, 43), (8, 35), (3, 3), (9, 35), (20, 3), (52,
     67), (32, 3), (13, 35), (3, 3), (11, 35), (5, 3), (9, 35), (4, 3), (6, 35), (4,
     3), (5, 35), (13, 3), (14, 19), (4, 3), (4, 19), (6, 3), (4, 19), (7, 3), (5,
     19), (8, 3), (8, 19), (5, 83), (3, 67), (2, 3), (12, 19), (5, 83), (17, 19), (9,
     83), (2, 19), (5, 3), (20, 67), (1, 75), (38, 11), (1, 43), (4, 35), (7, 3),
     (57, 67), (2, 3), (5, 19), (17, 3), (6, 19), (8, 3), (6, 67), (7, 83), (59, 3),
     (9, 19), (1, 51), (17, 35), (20, 3), (14, 35), (7, 3), (2, 35), (11, 43), (6,
     11), (7, 3), (26, 67), (7, 3), (13, 35), (5, 43), (7, 11), (2, 3), (5, 67), (1,
     83), (2, 19), (2, 3), (10, 67), (21, 3), (147, 0)]

Much nicer! And since our list-pro­cess­ing code is , so even if we replace the arbi­trary num­bers by con­struc­tors, we change noth­ing.

(Thanks to the denizens of #haskell for refac­tor­ing tips, and Don Stew­art’s blog entry on run-length encoding/decoding using Arrows.)

  1. Although Cabal makes this quite easy with its Path­s_* mech­a­nism.↩︎