Golfing Run Length Encoding in Haskell

Haskell: step by step refactoring to concision
Haskell, tutorial
2008-09-262013-01-16 finished certainty: highly likely importance: 2

Once I was play­ing with and work­ing on a Haskell clone of the old ar­cade games, Mona­dius. Most of my changes were not par­tic­u­larly in­ter­est­ing (clean­ing up, Ca­bal­iz­ing, fix­ing warn­ings, switch­ing all Integer to Int and so on), but in its De­mo.hs, I found an in­ter­est­ing so­lu­tion to a prob­lem, and it seems like a good small ex­am­ple of Haskell ab­strac­tion.


One of the prob­lems with lan­guage ad­vo­cacy is that it’s hard to have good ex­am­ples, since some­one who is writ­ing an es­say prob­a­bly will only come up with triv­ial ones, and some­one deal­ing with meaty prob­lems is­n’t think­ing about ad­vo­cacy but how to solve their prob­lem. I hope this ex­am­ple will be a lit­tle more sub­stan­tial than the one-lin­ers ex­am­ples of or .

Level data in Monadius

Sup­pose we have lev­els which are spec­i­fied by a pair of num­bers and then a long list of num­bers, often very rep­e­ti­tious. Per­haps a par­tic­u­lar level might be rep­re­sented this way:

level1 :: ((Int, Int), [Int])
level1 = ((2,1),[0,0,0,0,0,0,0,0,0,0,0,0,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,

This is an ugly way to de­fine a lev­el. We could just scrap this rep­re­sen­ta­tion as Int com­pletely and per­haps de­fine it us­ing a hy­po­thet­i­cal as

level1 :: ((Geometry,Geometry),[Enemy])
level1 = ((Tall,Narrow), [Flyer,Flyer,Flyer,Flyer,Flyer,Flyer,Shooter,PowerUp,Boss..])

But this rep­re­sen­ta­tion, while cer­tainly more un­der­stand­able, is still very rep­e­ti­tious. It will take even more space to write down.

We could au­to-derive the Read and Show type­classes for Geometry and Enemy datatypes, and then we could store level1, level2 etc. as some sort of “lev­elin­for­ma­tion.­dat” file, and read the level in­for­ma­tion in at run­time.

But that makes in­stal­la­tion and run­ning more diffi­cult1 still quite pos­si­ble, , and it does­n’t ad­dress the core is­sue: the in­for­ma­tion is writ­ten in a way that is so un­gainly that it’s next to im­pos­si­ble to write or even mod­ify!

The solution

We need some way of ex­press­ing this more con­cise­ly, of, in short, com­press­ing it. In this vein of thought, our first ob­ser­va­tion should be that we do not need to re­sort to fancy com­pres­sion li­braries or any­thing; there is an ob­vi­ous way to com­press it—the en­tire thing is a se­ries of re­peated num­bers.

We should be able to re­place the lengthy enu­mer­a­tion of frag­ments like [0,0,0,0,0,0,0,0,0,0,0,0] with some­thing sim­pler. For ex­am­ple, the num­ber of rep­e­ti­tions, and what is to be re­peat­ed: (12,0).

This is shorter and eas­ier to mod­i­fy. It is pos­si­ble that there may be a per­for­mance ben­e­fit here, as we’ve got­ten rid of a large con­stant that would have to be de­fined in the pro­gram it­self and in­stead re­placed it with a shorter func­tion which eval­u­ates to the same thing. (If we’re only on level 1, we don’t need to carry around the ex­panded ver­sion of the other lev­els, and when we go to level 2, level 1 will be garbage-col­lect­ed.)

Writing it

So, what is the type of our de­com­press­ing func­tion? Well, we need to turn a (Int,Int) into a [Int]; even bet­ter, we want to turn a whole list of (Int,Int)s into a sin­gle list of Ints. Thus, our end goal is go­ing to be a func­tion of this type:

rleDecode :: [(Int,Int)] -> [Int]

Let us tackle the sin­gle tu­ple ex­am­ple first. The sec­ond en­try de­fines what we need, and the first en­try de­fines how many we need, and that’s our type right there:

rleRecursiveDecode :: (Int, a) -> [a]

We could write a func­tion that takes the pa­ra­me­ter, de­creases the counter by 1, and cons on a copy of the sec­ond en­try. It could look like this:

rleRecursiveDecode (0,_) = []
rleRecursiveDecode (n,b) = b : (rleRecursiveDecode (n-1,b))

But this re­ally is not the best way to go. It is not nec­es­sar­ily easy to fol­low, and if there is one thing I have learned about Haskell pro­gram­ming, it is that the most ob­vi­ous ap­proach (in this case, prim­i­tive re­cur­sion) may not be the best way to go.

This is code that could have as well been writ­ten in Scheme or some­thing. It is com­pli­cated be­cause we are try­ing to en­sure that we gen­er­ate an item for the list only at the ex­act mo­ment we need to add it into the list; we are pro­gram­ming as if our lan­guage is strict, in other words. So, what is the lazy way of do­ing things? Why do we like lazi­ness to be­gin with?

One of the key ra­tio­nale for is that it pro­motes —one func­tion does­n’t need to know when or how much has been done by an­other func­tion. In this case, what con­cerns ought to be sep­a­rat­ed? Well, our de­sired list has 2 prop­er­ties: its length, and the sin­gle re­peated item.

Since we can cre­ate in­fi­nite lists, there is no need for the length to be set at the same time as the re­peat­ed-item is be­ing gen­er­at­ed. We can cre­ate an in­fi­nite list con­tain­ing the item, and we de­mand as many as we need. Lazy eval­u­a­tion means our in­fi­nite list does­n’t blow up. Sim­ple enough!

The gen­er­a­tion is too easy for words:

-- Define an infinite list of our item.
duplicate :: a -> [a]
duplicate a = a : duplicate a

But what’s our de­mand func­tion? A lit­tle think­ing and we know that we have a list, a num­ber, and we get back a list. Int -> [a] -> [a] de­scribes a few func­tions, the sec­ond of which turns out to be what we want: take.

rleLazyDecode :: (Int,Int) -> [Int]
rleLazyDecode (n,b) = take n (duplicate b)

Now, duplicate is a sim­ple enough func­tion to de­fine, but an­other type­-search will show that a -> [a] is al­ready de­fined in the stan­dard li­braries—re­peat.

So we scrap duplicate:

rleLazyDecode (n,b) = take n (repeat b)

Might as well re­move the paren­the­ses:

rleLazyDecode (n,b) = take n $ repeat b

We can do bet­ter. What is the type of \n b -> take n $ repeat b? It is Int -> a -> [a], and a Hoogle type­-search turns up an ex­act match: repli­cate. So we can im­prove it fur­ther:

rleLazyDecode (n,b) = replicate n b

Are we quite done? Well, one could ask: rleLazyDecade is now ba­si­cally noth­ing but an alias or new name for replicate, ex­cept for how it con­verts from a tu­ple (n,b) to nor­mal un­cur­ried ar­gu­ments n b. Is there any func­tion which will do this con­ver­sion for us? Well, we have a func­tion replicate :: a -> b -> c, and data (a,b) passed into said func­tion to pro­duce a c, so we want a func­tion with the type sig­na­ture (a -> b -> c) -> (a,b) -> c, which turns out to be some­thing called un­curry. Hav­ing rea­soned our way this far, the im­ple­men­ta­tion is much clearer than the type:

uncurry :: (a -> b -> c) -> (a, b) -> c
uncurry f (a,b) = f a b

That’s what we were just do­ing man­u­al­ly. So now the next ver­sion is clear:

rleLazyDecode = uncurry replicate

A sat­is­fy­ing, short, func­tion­al, and lazy one-lin­er. From here the de­fi­n­i­tion of rleDecode is al­most triv­ial: we ex­tend it to a list of tu­ples by throw­ing in a map, and we turn the re­sult­ing list of lists into a sin­gle list by way of con­cat:

rleDecode ns = concat $ map rleLazyDecode ns

We can tweak this fur­ther, as concat . map is a com­mon enough id­iom that there is a short­cut:

rleDecode ns = concatMap rleLazyDecode ns

As be­fore, we could have found concatMap by search­ing for the type sig­na­ture of concat . map: (a1 -> [a]) -> [a1] -> [a]. Aw heck­—let’s make it like rleLazyDecode:

rleDecode = concatMap rleLazyDecode

And then we sub­sti­tute in the rleLazyDecode de­fi­n­i­tion:

rleDecode = concatMap (uncurry replicate)

We can ac­tu­ally go even fur­ther into the realms of in­com­pre­hen­si­bil­i­ty. It turns out that lists are a mon­ad. This means we can use bind and all the rest of the op­er­a­tions de­fined by the Monad type­class to op­er­ate on lists and other things. So we can write this bizarre (but short and effec­tive) ver­sion of rleDecode:

rleDecode = (uncurry replicate =<<)

And we are done! We can now rep­re­sent the first level like this:

d1 = ((2,1),d)
  where d = rleDecode [(5, 3), (10, 67), (6, 3), (5, 67), (29, 3), (2, 11),
     (29, 3), (5, 35), (3, 3), (3, 35), (24, 3), (2, 35), (19, 3), (7, 19), (21, 51),
     (63, 35), (15, 43), (5, 11), (8, 43), (8, 35), (3, 3), (9, 35), (20, 3), (52,
     67), (32, 3), (13, 35), (3, 3), (11, 35), (5, 3), (9, 35), (4, 3), (6, 35), (4,
     3), (5, 35), (13, 3), (14, 19), (4, 3), (4, 19), (6, 3), (4, 19), (7, 3), (5,
     19), (8, 3), (8, 19), (5, 83), (3, 67), (2, 3), (12, 19), (5, 83), (17, 19), (9,
     83), (2, 19), (5, 3), (20, 67), (1, 75), (38, 11), (1, 43), (4, 35), (7, 3),
     (57, 67), (2, 3), (5, 19), (17, 3), (6, 19), (8, 3), (6, 67), (7, 83), (59, 3),
     (9, 19), (1, 51), (17, 35), (20, 3), (14, 35), (7, 3), (2, 35), (11, 43), (6,
     11), (7, 3), (26, 67), (7, 3), (13, 35), (5, 43), (7, 11), (2, 3), (5, 67), (1,
     83), (2, 19), (2, 3), (10, 67), (21, 3), (147, 0)]

Much nicer! And since our list-pro­cess­ing code is , so even if we re­place the ar­bi­trary num­bers by con­struc­tors, we change noth­ing.

(Thanks to the denizens of #haskell for refac­tor­ing tips, and Don Stew­art’s blog en­try on run-length en­cod­ing/de­cod­ing us­ing Ar­rows.)

  1. Al­though Ca­bal makes this quite easy with its Path­s_* mech­a­nism.↩︎