haskell-enumerator: An implementation of Oleg Kiselyov’s left-fold enumerators

Code: https://john-millikin.com/code/haskell-enumerator

History

This library is unmaintained and obsolete, and is of historical interest only.

It was originally a simplified implementation of Oleg Kiselyov’s left-fold enumerators (hence the name). Even with a reworked interface and smaller scope, the conceptual model of iteratees and enumerators proved too complex for widespread adoption. Error handling was particularly hard to reason about.

I was never able to find a use case that enumerators handled better than plain imperative code, and eventually abandoned the whole idea.

The most notable user of enumerators was Michael Snoyman's http-enumerator, which was also abandoned due to complexity.

For those interested in more information, several articles and tutorials are available:

And I made an attempt at it too, in Understanding Iteratees.

Original Project Summary

Say you want to read in a huge file. You're going to calculate its checksum, or count how many newlines it has, or whatever. How do you do that when the file's larger than your machine's memory?

In most languages, the answer is "write a loop". You read the file in small chunks, then run those chunks through whatever processing you want to do. Each loop might do something different with the data, but they've all got the same boilerplate structure.

Haskell programmers noticed that if you squint a bit, files look like really long lists of bytes. Haskell already has tons of functions which work on lists, so all they needed to do to get easy file processing was trick the compiler. The trick these programmers used is called "lazy I/O".

It turns out that lazy I/O has a big downside: it makes thinking about the program's resource requirements very difficult. Servers based on lazy I/O tend to run out of file descriptors, or allocate huge amounts of memory, without any obvious way to fix them.

Another approach to the problem is to go back to the original buffer/loop design, and chop it up. Loops are split into a data source (or enumerator), a data sink (or iteratee), and intermediate data transformers (or enumeratees). These types are composable just like basic list functions, so it's easy to build up complex data processors from re-usable components.

Examples

Here's a quick example; we're going to count how many Unicode characters are in a UTF-8 file.

import Data.Enumerator as Enum import Data.Enumerator.Binary as Binary import Data.Enumerator.Text as Text import System.Environment (getArgs) main :: IO () main = do args <- getArgs let filename = case args of [] -> error "Need a file to read from!" (x:_) -> x -- 'enumFile' is an enumerator (data source), which opens a file and -- streams its contents. 'decode' is an enumeratee (data transformer), -- which converts bytes into Unicode text. let enumFileUtf8 = Binary.enumFile filename $= Text.decode utf8 -- 'fold' takes an update function and initial state, then runs for -- each character in the stream. Here we start with 0, then increment -- by 1 for each character. let countChars = Text.fold (\n _ -> n + 1) 0 -- Up until now, we've just defined stuff. 'run' causes the pipeline -- to execute, and returns whatever the final iteratee yielded. count <- Enum.run_ (enumFileUtf8 $$ countChars) print count

(Old Article) Understanding Iteratees

Iteratees are an abstraction discovered by Oleg Kiselyov, which provide a performant, predictable, and safe alternative to lazy I/O. Though the data types involved are simple, their relationship to incremental processing is not obvious, and existing documentation ranges in quality from merely dense to outright baffling. This article attempts to clarify the concepts and use underlying iteratees.

Please note that these are my notes, as I attempt to implement iteratee–based libraries. I may have misunderstood minor or major parts of iteratees. If in doubt, the final authority is Oleg -- though understanding his answers requires a saving throw vs. confusion. Please e-mail me any comments or suggestions.

2010–08–19: the code available in this article has been expanded and packaged as the enumerator library.

Iteratees vs. Lazy I/O

Lazy I/O – eg, hGetContents and friends – is known to have several shortcomings. Most notably, IO errors can occur in pure code and Handles may remain open for arbitrary periods of time. Oleg notes[1] that this can lead to unexpected failures, due to resource exhaustion.

Iteratees do not suffer from these problems. Their resource use is bounded and predictable, and the type system provides guarantees that limited resources are released when no longer needed. Notably, iteratees can process arbitrarily large inputs in constant space.

Implementing iteratees

There are at least five generic iteratee libraries, each with differing type signatures and semantics. Oleg's Iteratee.hs, IterateeM.hs, & IterateeMCPS.hs, John Lato's iteratee package, and a post by Per Magnus Therning.

This page documents a sixth implementation, based on IterateeM, with simplified error handling and naming conventions (hopefully) more obvious to the average Haskell programmer.

data Chunk a = Chunk [a] | EOF deriving (Show, Eq) data Step e a m b = Continue (Chunk a -> Iteratee e a m b) | Yield b (Chunk a) | Error e newtype Iteratee e a m b = Iteratee { runIteratee :: m (Step e a m b) }

In general, an iteratee begins in the Continue state. As each chunk is passed to the continuation, the iteratee may return the next step, which is one of:

Continue
The iteratee requires more input before it can produce a result.
Yield
The iteratee has received enough input to generate a result, along with left–over input. If the iteratee will no longer accept input, it should yield EOF. If no input remains, but the iteratee can still accept more, it should yield Chunk [].
Error
The iteratee experienced an error which prevents it from proceeding further. The type of error contained will depend on the enumerator and/or iteratee – common choices are String and SomeException.

Based on these semantics, some simple instances can be created:

instance Monoid (Chunk a) where mempty = Chunk [] mappend (Chunk xs) (Chunk ys) = Chunk $ xs ++ ys mappend _ _ = EOF instance Functor Chunk where fmap _ EOF = EOF fmap f (Chunk xs) = Chunk $ map f xs instance (Show a, Show b, Show e) => Show (Step e a m b) where showsPrec d step = showParen (d > 10) $ case step of (Continue _) -> s "Continue" (Yield b chunk) -> s "Yield " . sp b . s " " . sp chunk (Error err) -> s "Error " . sp err where s = showString sp :: Show a => a -> ShowS sp = showsPrec 11

Slightly more complex is the Monad instance for iteratees. The first iteratee is run, and if it yielded a value, that value is fed into the second iteratee.

instance Monad m => Monad (Iteratee e a m) where return x = Iteratee . return . Yield x $ Chunk [] m >>= f = Iteratee $ runIteratee m >>= \mStep -> case mStep of Continue k -> return $ Continue ((>>= f) . k) Error err -> return $ Error err Yield x (Chunk []) -> runIteratee $ f x Yield x chunk -> runIteratee (f x) >>= \r -> case r of Continue k -> runIteratee $ k chunk Error err -> return $ Error err -- runIteratee (f x) does not consume any input; if it -- returns Yield, then its "extra" input must be -- (Chunk []) and can be ignored. Yield x' _ -> return $ Yield x' chunk instance MonadTrans (Iteratee e a) where lift m = Iteratee $ m >>= runIteratee . return instance MonadIO m => MonadIO (Iteratee e a m) where liftIO = lift . liftIO instance Monad m => Functor (Iteratee e a m) where fmap f i = i >>= return . f

Next, lets define a few simple primitive combinators for building iteratees from pure functions:

returnI :: Monad m => Step e a m b -> Iteratee e a m b returnI = Iteratee . return liftI :: Monad m => (Chunk a -> Step e a m b) -> Iteratee e a m b liftI k = returnI $ Continue (returnI . k) yield :: Monad m => b -> Chunk a -> Iteratee e a m b yield x chunk = returnI $ Yield x chunk continue :: Monad m => (Chunk a -> Iteratee e a m b) -> Iteratee e a m b continue k = returnI $ Continue k throwError :: Monad m => e -> Iteratee e a m b throwError err = returnI $ Error err

These combinators are sufficient to define simple iteratees; for example, a variation of dropWhile:

-- import Prelude hiding (dropWhile) -- import qualified Prelude as Prelude dropWhile :: Monad m => (a -> Bool) -> Iteratee e a m () dropWhile f = liftI step where step (Chunk xs) = case Prelude.dropWhile f xs of [] -> Continue $ returnI . step xs' -> Yield () (Chunk xs') step EOF = Yield () EOF

Or an iteratee for printing received chunks to stdout, useful for debugging:

printChunks :: MonadIO m => Show a => Bool -> Iteratee e a m () printChunks printEmpty = continue step where step (Chunk []) | not printEmpty = continue step step (Chunk xs) = liftIO (print xs) >> continue step step EOF = liftIO (putStrLn "EOF") >> yield () EOF

Finally, to extract the final result from an iteratee, it's sufficient to feed it EOF and check the returned Step. Note that a "well–behaved" iteratee continuation will always return Yield or Error in response to EOF – iteratees which return Continue may loop forever, depending on their monadic behavior.

run :: Monad m => Iteratee e a m b -> m (Either e b) run i = runIteratee i >>= check where check (Continue k) = runIteratee (k EOF) >>= check check (Yield x _) = return $ Right x check (Error e) = return $ Left e

Enumerators

Iteratees consume data from a sequence of input chunks. To generate those chunks, we define enumerators (and enumerator composition operators).

type Enumerator e a m b = Step e a m b -> Iteratee e a m b infixl 1 >>==, ==<< (>>==) :: Monad m => Iteratee e a m b -> (Step e a m b -> Iteratee e a' m b') -> Iteratee e a' m b' m >>== f = Iteratee (runIteratee m >>= runIteratee . f) (==<<):: Monad m => (Step e a m b -> Iteratee e a' m b') -> Iteratee e a m b -> Iteratee e a' m b' f ==<< m = m >>== f

Note that the Enumerator type is semantically equivalent to:

type Enumerator e a m b = Step e a m b -> m (Step e a m b)

Simple enumerators can be defined in terms of existing combinators. The basic format of an enumerator is that when it receives a Continue step, it passes a chunk to the continuation to generate its returned iteratee. Other step types are passed through unchanged.

enumList :: Monad m => [a] -> Enumerator e a m b enumList xs (Continue k) = case xs of [] -> k EOF (x:xs') -> k (Chunk [x]) >>== enumList xs' enumList _ step = returnI step

More complex enumerators require building the result manually. Note that while the recursive step is much larger in this example, the fundamental layout (loop on Continue, pass on others) remains.

enumHandle :: Handle -> Enumerator String ByteString IO b enumHandle h = Iteratee . allocaBytes bufferSize . loop where bufferSize = 4096 loop (Continue k) = do_read k loop step = const $ return step do_read k p = do n <- try $ hGetBuf h p bufferSize case (n :: Either SomeException Int) of Left err -> return $ Error $ show err Right 0 -> return $ Continue k Right n' -> do bytes <- packCStringLen (p, n') step <- runIteratee (k (Chunk [bytes])) loop step p

In some cases, it might make more sense to define this enumerator in terms of bytes rather than byte strings. The required changes are minor – the bytes are stored directly in the Chunk list.

enumHandle :: Handle -> Enumerator String Word8 IO bRight n' -> do bytes <- F.peekArray n' p step <- runIteratee (k (Chunk bytes)) loop step p

Enumeratees

Enumerators generate data, iteratees consume it. When a value needs to generate a stream using another stream as input, it is named an enumeratee.

type Enumeratee e aOut aIn m b = Step e aIn m b -> Iteratee e aOut m (Step e aIn m b)

Most interesting transformations in iteratee-based code are enumeratees. For example, map can be encoded as an enumeratee:

checkDone :: Monad m => ((Chunk a -> Iteratee e a m b) -> Iteratee e a' m (Step e a m b)) -> Enumeratee e a' a m b checkDone _ (Yield x chunk) = return $ Yield x chunk checkDone f (Continue k) = f k checkDone _ (Error err) = throwError err mapI :: Monad m => (ao -> ai) -> Enumeratee e ao ai m b mapI f = checkDone $ continue . step where step k EOF = yield (Continue k) EOF step k (Chunk []) = continue $ step k step k chunk = k (fmap f chunk) >>== mapI f

A more complex example: sequenceI converts an iteratee to an enumeratee, by feeding it input until it returns EOF. This is useful for chaining iteratees together, to support embedded streams.

finished :: Monad m => Iteratee e a m Bool finished = liftI $ \chunk -> case chunk of EOF -> Yield True EOF _ -> Yield False chunk sequenceI :: Monad m => Iteratee e ao m ai -> Enumeratee e ao ai m b sequenceI i = checkDone check where check k = finished >>= \f -> if f then yield (Continue k) EOF else step k step k = i >>= \v -> k (Chunk [v]) >>== sequenceI i

A join combinator is useful for "extracting" an output stream from an enumeratee's result.

joinI :: Monad m => Iteratee e a m (Step e a' m b) -> Iteratee e a m b joinI outer = outer >>= check where check (Continue k) = k EOF >>== check check (Yield x _) = return x check (Error e) = throwError e

  1. Oleg Kiselyov – Lazy vs correct IO