Summer of Code project suggestions

Nov

16

The design of the Strict Haskell pragma
Since the Strict Haskell pragma just landed in HEAD, thanks to the hard work of my GSoC student Adam Sandberg Ericsson, I thought I'd share my thoughts behind the design.

First, is this making Haskell a strict language? No. This pragma lets you reverse the default on a per-module basis. Since the pragma will never be used in every module (most notably those in base) this doesn't make Haskell a strict language. Making Haskell strict is not possible at this point.

This pragma (and its companion StrictData)
- lets us get rid of some of the syntatic noise and error-proneness in code that needs to be (mostly) strict and
- lets us experiment with how switching the default might affect code, in particular code with performance problems.
The first is a quite practical "might be useful in real code soon" kind of thing and the second is a "lets do some sciene and see what happens" kind of thing.

What's actually in GHC 8.0?

There are two new pragmas, Strict and StrictData, where the second is a subset of the former.

Strict makes all constructor fields and bindings (i.e. let/where/case/function arguments) strict, as if you had put a bang pattern on all of them.

StrictData does the same, but only to constructor fields.

Both affect only definitions in the current module. For example, using StrictData doesn't mean that the tuple types defined in base have strict fields and using Strict doesn't mean that existing fmap instances are strict, etc. Having a Strict pragma affecting other modules wouldn't make much sense, because those other modules might rely on laziness for correctness.

See the wiki page for more details on the actual semantics and implementation.

Practical reasons for this pragma

Writing production quality [1] Haskell code often detracts from the beauty of Haskell. For example, instead of writing
```
data Vector2 = V2 Double Double
```
you need to write
```
data Vector2 = V2 {-# UNPACK #-} !Double {-# UNPACK #-} !Double
```
when you define a data type with small (e.g. Int or Double) fields. Even for larger fields you often want to make the field strict (and often unpacked). This syntactic noise hurts readability.

The same applies to functions. Here's an example from unordered-containers (slightly modified for pedagogical purposes):
```
insert :: (Eq k, Hashable k) => k -> v -> HashMap k v -> HashMap k v
insert k v m = go h k v 0 m
  where
    h = hash k
    go !h !k !x !_ Empty = Leaf h (L k x)
    ...
```
This function is entirely strict, so all those bangs are just noise. In addition, knowing why they need to be there is not entirely obvious.

The new Strict pragma, together with a previous change I made to unpacking, lets us write the code we want to write because
- the Strict pragma lets us leave out the bangs but still get strict fields [2] and
- default unpacking of small fields (see -funbox-small-strict-fields) gets us the unpacking without the pragmas.
So with GHC 8.0 you can write e.g. the definition of Vector2 you want to write, but get the representation you need.

When can I use this in my code?

The first thing you should consider is whether you want to only try StrictData or also Strict. The former is a quite modest change; you're probably (mostly) using strict constructor fields anyway so using StrictData will just remove some noise and force you to be explicit about when you actually want lazy fields. I'm hoping some application developers will start using StrictData.

Using the the Strict pragma would be a bigger change for your code base (I expect), so that should be approached with more caution. If nothing else it could be interesting to compile your own code with it and see what effect that has, both in terms of correctness (i.e. how often do you actually rely on laziness) and performance.

So when can you actually use this stuff?

If you're maintaing libraries that have lots of users and you need to support the three latest major GHC releases: when GHC 8.4 is out. Doing it before is probably not worth it as you'd need to use CPP. Just use bang patterns until then.

If you're writing an application where you can pick just a single GHC version and stick with it, you can try to use Strict/StrictData as soon as GHC 8.0 is out.

Scientific reasons for this pragma

For quite some time I've wanted to do an emperical analysis of lazy-by-default languages, in particular on performance correctness.

While we could compare e.g. OCaml with Haskell that's not a great comparison as the two languages differ in quite a few aspects (e.g. one uses monads as a way to control side effects, the GC implementations are different, etc).

What I plan to do is to take a number of sample programs e.g.
- all programs on StackOverflow tagged with "haskell" and "performance", for a selection of programs with performance problems,
- all of nofib, or
- all of nofib but with base compiled with Strict (or StrictData),
- some more realistic programs,
- etc.
and see what happens if we compile them with Strict or StrictData.

I'd like to look at the results along two axes: does the program produce the same result under both defaults and if it does, does the strict default change the performance? Of course, doing some analysis on the causes e.g. where do programmers rely on laziness and what's the main reason strictness affects performance would be interesting as well.

[1] Defining "production quality code" probably deserves a whole post on its own. For the purpose of this post think of it as meaning code without surprised, in the sense that it produces the expected result using the expect computational resources. For example of of non-production quality code, consider a CSV library that needs 1.4GB of RAM to parse a 100MB CSV file.

[2] Here's it's actually enough to use the StrictData pragma, which is a subset of the Strict pragma.
Posted 16th November 2015 by Johan Tibell

4
View comments
Oct

21

Video of my Haskell eXchange 2015 talk

Thanks to the fast people at Skill Matter, my Haskell eXchange 2015 talk is now available online (slides).

Posted 21st October 2015 by Johan Tibell

5
View comments
Sep

29

Video of my performance optimization talk at ZuriHac 2015

This blog post is a couple of months late, but I thought I should mention that a talk on performance optimization I gave a while ago at ZuriHac 2015 is now on YouTube (slides). Unfortunately the first minutes got cut off and the sound quality isn't the best, but hopefully it's something.

P.S. I will be giving a rewritten and expanded version of this talk at Haskell eXchange 2015.

Posted 29th September 2015 by Johan Tibell

0
Add a comment
Mar

17

Google Summer of Code 2015 Project Ideas
Every year I try to list some Google Summer of Code projects that I think are worthwhile. Here's this year's list. The focus is, as usual, on infrastructure projects. I think they are the most likely to succeed and give the community the highest long-term value.

Generalize cabal to work with collections of packages instead of having a single package focus

Today most cabal commands assume that there's a "current" package that's the focus of most operations. As code bases grow users will end up with multiple-package projects and thus a need to build/test subsets of those packages together. We should generalize most commands (e.g. build, test, bench, and repl) to take a list of targets. You can already today specify local targets (i.e. only run these N test suites in my .cabal file). We'd extend the syntax to allow for
```
cabal test [[PACKAGE DIR]:[SECTION]]...
```
Example:
```
cabal test my-pkg1-dir:test1 my-pkg1-dir:test2 my-pkg2-dir
```
Implementation wise this means that the working dir (i.e. dist/ today) needs to be able to live outside the some "current" package's dir. It would live in the project "root" (e.g. some director that contains all the packages that are checked out in source form.)

We should also make it more convenient to create these collections of packages. Today you can get half way there by creating a sandbox and manually add-source all your packages. That doesn't scale well to 100-1000s of packages. We should have more clever defaults (e.g. scan subdirs for .cabal files and consider them part of the package set.)

Cabal PVP compliance checker

A Cabal sub-command that given two package versions tells you which version number component that needs to be bumped (according to the PVP) and why. See for example the Elm solution to this problem.

Have GHC track the different library "ways" it has installed

Today GHC doesn't know if it has installed e.g. both the vanilla and profiling versions of a library. This leads to annoying error messages and an unpleasant experience when users want to profile something. If GHC tracked which versions of a library it had installed cabal could easily and automatically installed the missing profiling versions.

There's some more info on the GHC Trac: 1, 2.

Strict Haskell language pragma

This is a project for a student with some background in compilers, perhaps even with some experience working on GHC.

As an experiment in language design I'd like to add two language pragmas to GHC, Strict and StrictData, the latter being a subset of the former. These pragmas would give better control over the strictness of code, on a per-module basis. The design is quite well fleshed out on the wiki. There's also the beginings of an implementation of StrictData available.
Posted 17th March 2015 by Johan Tibell

0
Add a comment
May

1

Ekg 0.4 Released
It's been over two years since the last major release of ekg. Ever since the first release I knew that there were a number of features I wanted to have in ekg that I didn't implement back then. This release adds most of them.

Integration with other monitoring systems

When I first wrote ekg I knew it only solved half of the program monitoring problem. Good monitoring requires two things
1. a way to track what your program is doing, and
2. a way to gather and persist that data in a central location.
The latter is neccesary because
- you don't want to lose your data if your program crashes (i.e. ekg only stores metrics in memory),
- you want to get an aggregate picture of your whole system over time, and
- you want to define alarms that go off if some metric passes some threshold.
Ekg has always done (1), as it provides a way to define metrics and inspect their values e.g. using your web browser or curl.

Ekg could help you to do (2), as you could use the JSON API to sample metrics and then push them to an exiting monitoring solution, such as Graphite or Ganglia. However, it was never really convenient.

Today (2) will get much easier.

Statsd integration

Statsd is

A network daemon that ... listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite).

Statsd is quite popular and has both client and server implementations in multiple languages. It supports quite a few backends, such as Graphite, Ganglia, and a number of hosted monitoring services. It's also quite easy to install and configure (although many of the backends it uses are not.)

Ekg can now be integrated with statsd, using the ekg-statsd package. With a few lines you can have your metrics sent to a statsd:
```
main = do
    store <- newStore
    -- Register some metrics with the metric store:
    registerGcMetrics store
    -- Periodically flush metrics to statsd:
    forkStatsd defaultStatsdOptions store
```
ekg-statsd can be used either together with ekg, if you also want the web interface, or standalone if the dependencies pulled in by ekg are too heavyweight for your application or if you don't care about the web interface. ekg has been extended so it can share the Server's metric store with other parts of the application:
```
main = do
    handle <- forkServer "localhost" 8000
    forkStatsd defaultStatsdOptions (serverMetricStore handle)
```
Once you set up statsd and e.g. Graphite, the above lines are enough to make your metrics show up in Graphite:

image

Integration with your monitoring systems

The ekg APIs have been re-organized and the package split such that it's much easier to write your own package to integrate with the monitoring system of your choice. The core API for tracking metrics has been split out from the ekg package into a new ekg-core package. Using this package, the ekg-statsd implementation could be written in a mere 121 lines.

While integrating with other systems was technically possible in the past, using the ekg JSON API, it was both inconvenient and wasted CPU cycles generating and parsing JSON. Now you can get an in-memory representation of the metrics at a given point in time using the System.Metrics.sampleAll function:
```
-- | Sample all metrics. Sampling is /not/ atomic in the sense that
-- some metrics might have been mutated before they're sampled but
-- after some other metrics have already been sampled.
sampleAll :: Store -> IO Sample

-- | A sample of some metrics.
type Sample = HashMap Text Value

-- | The value of a sampled metric.
data Value = Counter !Int64
           | Gauge !Int64
           | Label !Text
           | Distribution !Stats
```
All that ekg-statsd does is to call sampleAll periodically and convert the returned Values to UDP packets that it sends to statsd.

Namespaced metrics

In a large system each component may want to contribute their own metrics to the set of metrics exposed by the program. For example, the Snap web server might want to track the number of requests served, the latency for each request, the number of requests that caused an internal server error, etc. To allow several components to register their own metrics without name clashes, ekg now supports namespaces.

Namespaces also makes it easier to navigate metrics in UIs. For example, Graphite gives you a tree-like navigation of metrics based on their namespaces.

In ekg dots in metric names are now interpreted as separating namespaces. For example, the default GC metric names now all start with "rts.gc.". Snap could for example prefix all its metric names with "snap.". While this doesn't make collisions impossible, it should make them much less likely.

If your library want to provide a set of metrics for the application, it should provide a function that looks like this:
```
registerFooMetrics :: Store -> IO ()
```
The function should call the various register functions in System.Metrics. It should also document which metrics it registers. See System.Metrics.registerGcMetrics for an example.

A new metric type for tracking distributions

It's often desirable to track the distribution of some event. For example, you might want to track the distribution of response times for your webapp, so you can get notified if things are slow all of a sudden and so you can try to optimize the latency.

The new Distribution metric lets you do that.

Every time an event occurs, simply call the add function:
```
add :: Distribution -> Double -> IO ()
```
The add function takes a value which could represent e.g. the number of milliseconds it took to serve a request.

When the distribution metric is later sampled you're given a value that summarizes the distribtuion by providing you with the mean, variance, min/max, and so on.

The implementation uses an online algorithm to track these statistics so it uses O(1) memory. The algorithm is also numerically stable so the statistics should be accurate even for long-running programs.

While it didn't make this release, in the future you can look forward to being able to track both quantiles and keep histrograms of the events. This will let you track e.g. the 95-percentile response time of your webapp.

Counters and gauges are always 64-bits

To keep ekg more efficient even on 32-bit platforms, counters and gauges were stored as Int values. However, if a counter is increased 10,000 times per second, which isn't unusual for a busy server, such a counter would wrap around in less than 2.5 days. Therefore all counters and gauges are now stored as 64-bit values. While this is technically a breaking change, it shouldn't affect the majority of users.

Improved performance and multi-core scaling

I received a report of contention in ekg when multiple cores were used. This prompted me to improve the scaling of all metrics types. The difference is quite dramatic on my heavy contention benchmark:
```
        +RTS -N1  +RTS -N6
Before    1.998s   82.565s
 After    0.117s    0.247s
```
The benchmark updates a single counter concurrently in 100 threads, performing 100,000 increments per thread. It was run on a 6 core machine. The cause of the contention was atomicModifyIORef, which has been replaced by an atomic-increment instruction. There are some details on the GHC Trac.

In short, you shouldn't see contention issues anymore. If you, I still have some optimizations that I didn't apply because the implementation should already be fast enough.
Posted 1st May 2014 by Johan Tibell

4
View comments
Apr

20

Announcing cabal 1.20
On behalf of all cabal contributors, I'm proud to announce cabal 1.20. This is quite a big release, with 404 commits since 1.18. To install:
```
cabal update
cabal install Cabal-1.20.0.0 cabal-install-1.20.0.0
```
New features

Since there are 404 commits since cabal 1.18, there are too many changes to give all of them a just treatment here. I've cherry-picked some that I thought you would find interesting.

To see the full list of commits, run:
```
git log Cabal-v1.18.1.3..Cabal-v1.20.0.0
```
in the cabal repo.

Dependency freezing

If you're building an application that you're delivering to customers, either as binary or as something that runs on your servers, you want to make sure unwanted changes don't sneak in between releases.

For example, say you've found a bug in the just released 1.0.0.0 version and you want to release 1.0.0.1, which contains a fix. If you build the binary on a different machine than you built the release on, you risk building it against a slightly different set of dependencies, which means that your introducing untested code into your application, possible causing new bugs.

cabal freeze lets developers of applications freeze the set of dependencies used to make builds reproducible. cabal freeze computes a valid set of package versions, just like cabal install does, and stores the result in cabal.config. You can then check cabal.config into your source control system to make sure that all developers that work on the application build against the exact same set of dependencies.

Here's the contents of the cabal.config file created by running cabal freeze in the cabal-install repo:
```
constraints: ...
             HTTP ==4000.2.8,
             array ==0.4.0.1,
             base ==4.6.0.1,
             bytestring ==0.10.0.2,
             ...
```
If you later want to update the set of dependencies either:
- manually edit cabal.config or
- delete (parts of) cabal.config and re-run cabal freeze.
Note that you should only use cabal freeze on applications you develop in-house, not on packages you intend to upload to Hackage.

Parallel cabal build

Starting with 7.8, GHC now accepts a -j flag to allow using multiple CPU cores to build a single package. This build performance enhancement is now exposed as a -j flag on cabal build (and also cabal test/bench/run). Build time improvements are modest, but free.

Flag to temporary ignore upper bounds

When new major versions of a package P is released, it usually takes a little while for packages that depend on P to update their version bounds to allow for the new version. This can be frustrating if you just want to test if your own package, which depends on both some of these other packages and on P, builds with the new P.

The --allow-newer=P flag tells the dependency solver to ignore all upper version bounds on P, allowing you to try to compile all packages against this newer version.

Unnecessary re-linking avoidance

Before 1.20, cabal would sometimes re-link build targets that hadn't changed. For example, if you had several test suites that tested different parts of your library, every test suite would be re-linked when you ran cabal test, even if no source file that the test suite depends on was changed. The same thing would happen for executables and benchmarks.

Now cabal doesn't re-link executables (of any kind) unless something changed.

Streaming cabal test output

cabal test can now stream its output to stdout, making it easier to see progress for long-running tests. Streaming output is enabled by passing --show-details=streaming to cabal test and is off by default (for now.)

New cabal exec command

Cabal sandboxes have almost completely replaced previous sandbox implementations. There was one use case that wasn't completely captured by the integrated sandbox implementation, namely starting new shells where the environment was set up to automatically use the sandbox for all GHC commands.

cabal exec allows you to launch any binary in an environment where the sandbox package DB is used by default. In particular you can launch a new shell using cabal exec [ba]sh.

Haddock configuration options

Haddock options can now be set in ~/.cabal/config. Here are the options and their default values:
```
haddock
     -- keep-temp-files: False
     -- hoogle: False
     -- html: False
     -- html-location:
     -- executables: False
     -- tests: False
     -- benchmarks: False
     -- all:
     -- internal: False
     -- css:
     -- hyperlink-source: False
     -- hscolour-css:
     -- contents-location:
```
How to use cabal

While not strictly related to this release, I thought I'd share how we expect users to use cabal. Using cabal this way makes sure that the features work well for you, now and in the future.

The main message is this: to build a package, use build, not install.

Building packages using cabal install comes from a time when
- installing dependencies was more difficult,
- depending on non-published packages was difficult (i.e. no sandbox add-source), and
- using the other commands required manual configure-ing.
My intention is to remove the need for install from the development workflow altogether. Today the recommended way to build a package is to run this once:
```
cabal sandbox init
cabal install --only-dep  # optionally: --enable-tests
```
The development workflow is then just
```
cabal build/test/bench/run
```
No configuring (or manual rebuilding) needed. build implies configure and test/bench/run imply build.

Soon build will also imply the above dependency installation, when running in a sandbox.

Credits

Here are the contributors for this release, ordered by number of commits:
- Mikhail Glushenkov
- Johan Tibell
- Duncan Coutts
- Thomas Tuegel
- Ian D. Bollinger
- Ben Armston
- Niklas Hambüchen
- Daniel Trstenjak
- Tuncer Ayaz
- Herbert Valerio Riedel
- Tillmann Rendel
- Liyang HU
- Dominic Steinitz
- Brent Yorgey
- Ricky Elrod
- Geoff Nixon
- Florian Hartwig
- Bob Ippolito
- Björn Peemöller
- Wojciech Danilo
- Sergei Trofimovich
- Ryan Trinkle
- Ryan Newton
- Roman Cheplyaka
- Peter Selinger
- Niklas Haas
- Nikita Karetnikov
- Nick Smallbone
- Mike Craig
- Markus Pfeiffer
- Luke Iannini
- Luite Stegeman
- John Lato
- Jens Petersen
- Jason Dagit
- Gabor Pali
- Daniel Velkov
- Ben Millwood
Apologies if someone was left out. Once in a while we have to make commits on behalf of an author. Those authors are not captured above.
Posted 20th April 2014 by Johan Tibell

13
View comments
Mar

1

Google Summer of Code Projects
Every year I put together a list of Google Summer of Code projects I'd like see students work on. Here's my list for this year.

As normal the focus is on existing infrastructure. I believe, and I think our experience in the past bears out, that such projects are more successful.

Improved Hackage login

The Hackage login/user system could use several improvements.

From a security perspective, we need to switch to a current best practice implementation of password storage, such as bcrypt, scrypt, or PBKDF2. MD5, which is what HTTP Digest Auth uses, has known attacks.

From a usability perspective, we need to move to a cookie-based login system. While using HTTP auth is convenient from an implementation perspective, it doesn't work well from a usability perspective (that's why sites that otherwise try to follow the REST approach don't use HTTP auth.) A cookie-based approach allows us to, among other things,
- display the current login status of the user,
- allow users to conveniently access a user preference page,
- allow users to log out, and
- adapt the UI to the current user.
An example of the latter would be to only show a link to the maintainer section for packages you maintain or show additional actions for the site admins. HTTP auth introduces an extra page transition if you want to move from a list to items to edit that list of items (e.g. you can edit uploaders on /packages/uploaders/, you need to click on the link that takes you to /packages/uploaders/edit.) This is because HTTP auth does authentication on a per HTTP request basis.

Other Hackage usability improvements

There are several other Hackage usability improvements I'd like to see.

The homepage is currently a write-up about the new Hackage server. While that made sense when the new Hackage server was brand new, a more useful homepage would include a list of recently updates packages, most popular packages, packages you maintain, and a link to "getting started" material and other documentation. Looking at other languages' package repo homepages for inspiration wouldn't be a bad start.

The search result page should include download counts and a more easily scannable result list. The current list is hard to read because the package descriptions don't line up. For example, compare the search result page for "xml" for Hackage and Ruby Gems.

Faster Cabal/GHC parallel builds

Mikhail Glushenkov and others have done a great job making our compiles faster. Cabal already builds packages in paralell and with GHC 7.8 it will build modules in parallel as well.

There are still more opportunities for parallelism. Cabal doesn't build individual components or different versions of the same component (e.g. vanilla and profiling) in parallel.

Building all the test suites in parallel would save time if you have many test suites and building vanilla and profiling versions at the same time would allow users to turn on profiling by default (in ~/.cabal/config) without paying (much of) a compile time penalty.

There's already some work underway here so there might not be enough Cabal work to last a student through the summer. The remaining time could be spent increasing the amount of parallism offered by ghc -j.

Today the parallel speed-up offered by ghc -j is quite modest and I believe we ought to be able to increase it. If you exclude link times, if we had N independent modules of the same size we should get close to a N times parallel speed-up, which I don't think we do today. While real packages don't have this much available parallelism, improvements in the embarrasingly parallel case should help the average case.

Cabal file pretty-printer

If we had a Cabal file pretty printer, in the spirit of go-fmt for Go, we could more easily apply automatic rewrites to Cabal files. Having a formatter that applies a standard (i.e. normalizing) format to all files would make rewrites tools much simpler, as they wouldn't have to worry about preserving user formatting. Some tools that would benefit:
- cabal freeze, which will be included in Cabal-1.20
- cabal init
- A cabal version number bumper/PVP helper
I don't think such a pretty-printer should be terribly clever. Since Cabal files don't support pattern matching (like Haskell), aligning things doesn't really help readability much. Something simple like a 2 (or 4) space ident and starting each list of items on a new line below the item "header" ought to be enough. Here's an example:
```
name: Cabal
version: 1.19.2
copyright: 2003-2006, Isaac Jones
           2005-2011, Duncan Coutts
license: BSD3
license-file: LICENSE
author:
  Isaac Jones <ijones@syntaxpolice.org>
  Duncan Coutts <duncan@community.haskell.org>
maintainer: cabal-devel@haskell.org
homepage: http://www.haskell.org/cabal/
bug-reports: https://github.com/haskell/cabal/issues
synopsis: A framework for packaging Haskell software
description:
  The Haskell Common Architecture for Building Applications and
  Libraries: a framework defining a common interface for authors to
  more easily build their Haskell applications in a portable way.
  .
  The Haskell Cabal is part of a larger infrastructure for
  distributing, organizing, and cataloging Haskell libraries and
  tools.
category: Distribution
cabal-version: >=1.10
build-type: Custom

extra-source-files:
  README tests/README changelog

source-repository head
  type:     git
  location: https://github.com/haskell/cabal/
  subdir:   Cabal

library
  build-depends:
    base       >= 4 && < 5,
    deepseq    >= 1.3 && < 1.4,
    filepath   >= 1 && < 1.4,
    directory  >= 1 && < 1.3,
    process    >= 1.0.1.1 && < 1.3,
    time       >= 1.1 && < 1.5,
    containers >= 0.1 && < 0.6,
    array      >= 0.1 && < 0.6,
    pretty     >= 1 && < 1.2,
    bytestring >= 0.9

  if !os(windows)
    build-depends:
      unix >= 2.0 && < 2.8

  ghc-options: -Wall -fno-ignore-asserts -fwarn-tabs
```
Posted 1st March 2014 by Johan Tibell

4
View comments
Apr

25

More haskell.org GSoC ideas
Here are another two haskell.org GSoC ideas.

Faster ghc -c building

The main obstacle to implementing completely parallel builds in Cabal is that calling ghc --make on a bunch of modules is much faster than calling ghc -c on each module.

Today, Cabal builds modules by passing them to ghc --make, but we'd like to create the module dependency graph ourselves and then call ghc -c on each module, in parallel. However, since ghc -c is so much slower than ghc --make, the user needs 2 (and sometimes even more) cores for parallel builds to pay off. If we could make ghc -c faster, we'd could write a parallel build system that actually gives good speed-ups on typical developer hardware.

The project would involve profiling ghc -c to figure out where the time is spent and then trying to improve performance. One possible source of inefficiency is reparsing all .hi files every time ghc -c is run.

Preferably the student should be at least vaguely familiar with the GHC source code.

Cabal dependency solver improvements

There's one shortcoming in the Cabal package dependency solver that is starting to bite more and more often, namely not treating the components in a .cabal file (i.e. libraries, executables, tests, and benchmarks) as separate entities for the purpose of dependency resolution. In practice this means that for many core libraries this fails:
```
cabal install --only-dependencies --enable-tests
```
but this succeeds:
```
cabal install <manually compiled list of test dependencies>
cabal install --only-dependencies
```
The reason is that the Library defined in the package is a dependency of the test framework (i.e. the test-framework package), creating a dependency cycle involving the library itself and the test framework. However, if the test dependency is expressed as:
```
library foo
  ...

test-suite my-tests
  -- No direct dependency on the library:
  hs-source-dirs: . tests
```
the dependency solver could find a solution, as the test suite no longer depends on the library, but it doesn't today.

The project would involve fixing the solver to treat each component separately (i.e. as if it was a separate package) for the purpose of dependency resolution.

For an example of this problem see the hashable package's .cabal file. In this case the dependency cycle involves hashable and Criterion.
Posted 25th April 2013 by Johan Tibell

2
View comments
Apr

15

Haskell.org GSoC ideas
This year's Google Summer of Code is upon us. Every year I try to come up with a list of projects that I'd like see done. Here's this year's list, in no particular order:

Better formatting support for Haddock

While adequate for basic API docs, Haddock leaves something to be desired when you want to write more than a paragraph or two.

For example, you can't use bold text for empahsis or have links with anchor text. Support for images or inline HTML (e.g. for creating tables) is similarly missing. All headings need to be defined in the export list which is both inconvenient and mixes up API organization with the use of headings to structure longer segments of text.

This project would try to improve the state of Haskell documentation by improving the Haddock markup language by either
- adding features from Markdown to the Haddock markup language, or
- adding a new markup language that is a superset of Markdown to Haddock.
Why Markdown? Markdown is what most of the programming-related part of the web (e.g. GitHub, StackOverflow) has standardized on as a human-friendly markup language. The reason Markdown works so well is that it codifies the current practice, already used and improved over time in e.g. mailing list discussions, instead of inventing a brand new language.

Option 1: Add Markdown as an alternative markup language

This option would let users opt-in to use (a super set of) Markdown instead of the current Haddock markup by putting a
```
{-# HADDOCK Markdown #-}
```
pragma on top of the source file. The language would be a superset as we'd still want to support single-quoted strings to hyperlink identifiers, etc.

This option might run into difficulties with the C preprocessor, which also uses # for its markup. Part of the project would involve thinking about that problem and more generally the implications of using Markdown in Haddock.

Option 2: Add features from Markdown to Haddock

This option is slightly less ambitious in that we'd be adding a few select features from Markdown to Haddock, for example support for bold text and anchor texts. Since we're not trying to support all of Markdown the issue with the C preprocessor could be solved by not supporting #-style headings while still supporting *bold* for bold text.

More parallelism in Cabal builds

Builds could always be faster. There are a few more things that Cabal could build in parallel:
- Each component (e.g. tests) could be built in parallel, while taking dependencies into account (i.e. from executables to the library).
- Profiling and non-profiling versions could be built in parallel, making it much cheaper to always enable profiling by default in ~/.cabal/config.
- Individual modules could be built in parallel.
The last option gives the most parallelism but is also the hardest to implement. It requires that we have correct dependency information (which we could get from ghc -M) and even then compiling individual modules using ghc -c is up to 2x slower compared to compiling the same modules with ghc --make. Still, it could be a win for anyone with >2 CPU cores and it would support building e.g. profiling libraries in parallel without much (or any) extra work.
Posted 15th April 2013 by Johan Tibell

6
View comments
Nov

17

Streaming and incremental CSV parsing using cassava
Today I released the next major version of cassava, my CSV parsing and encoding library.

New in this version is streaming and incremental parsing, exposed through Data.Csv.Streaming and Data.Csv.Incremental respectively. Both approaches allow for O(1)-space parsing and more flexible error handling. The latter also allows for interleaving parsing and I/O.

The API now exposes three ways to parse CSV files, ordered from most convenient and least flexible to least convenient and most flexible:
- Data.Csv
- Data.Csv.Streaming
- Data.Csv.Incremental
For example, Data.Csv causes the whole parse to fail if there are any errors, either in parsing or type conversion. This is convenient if you want to parse a small to medium-sized CSV file that you know is correctly formatted.

On the other extreme, if you're parsing a 1GB CSV file that's being uploaded by some user of your webapp, you probably want to use the Data.Csv.Incremental module, to avoid high memory usage and to be able to more graciously deal with formatting errors in the user's CSV file.

Other notable changes:
- The various index-based decode functions now take an extra argument that allow you to skip the header line, if the file has one. Previously you had to use the name-based decode functions to work with files that contained headers.
- Space usage in Data.Csv.decode and friends has been reduced significantly. However, these decode functions still have somewhat high space usage, so if you're parsing 100MB or more of CSV data, you want to use the Streaming or Incremental modules. I have plans on improving space usage by a large amount in the future.
Posted 17th November 2012 by Johan Tibell

2
View comments

View comments

GregMarch 21, 2011 at 12:34 PM
There are a lot of ways to improve cabal:
http://www.reddit.com/r/haskell_proposals/comments/fqey1/improve_cabal/

Personally, I would put parallel builds lower on the totem pole, and sandboxing at the highest.
I hope we can have at least 2 students working on the cabal infrastructure.
AnklesariaMarch 22, 2011 at 4:08 PM
The referenced ticket for parallel cabal builds only seems to talk about making cabal-install parallel. Does the project concern parallelization of Cabal itself as well?
Johan TibellMarch 22, 2011 at 6:37 PM
Amsay, the project is focused on building packages in parallel as that's where the biggest potential gain is at the moment. Once that's implemented we could look for other opportunities for parallelism.
AnklesariaMarch 22, 2011 at 6:48 PM
The ticket also says that "downloads seem to be serialized, again because there is probably little benefit to making multiple connections to the same server." Why is this? Are bandwidth restrictions really so severe?
Johan TibellMarch 22, 2011 at 8:16 PM
Amsay,

Presumably a single connection to a server is enough to use all available bandwidth between the client and the server. If we had multiple servers things would be different.
LambdorMarch 24, 2011 at 2:10 PM
> Build multiple Cabal packages in parallel

What's or would be the difference to passing ghc-options="+RTS -N2 -RTS" to cabal?
see http://lambdor.net/?p=306
Johan TibellMarch 24, 2011 at 7:05 PM
Lambdor,

-N2 doesn't help unless Cabal uses more than one thread e.g. by calling forkIO, which it doesn't.

Project: Build multiple Cabal packages in parallel

Project: Simpler support for isolated/sandboxed Cabal builds

Project: Convert the text package to use UTF-8 internally

View comments

Johan Tibell

Haskell and other things that interest me

What's actually in GHC 8.0?

Practical reasons for this pragma

When can I use this in my code?

Scientific reasons for this pragma

View comments

View comments

Add a comment

Generalize cabal to work with collections of packages instead of having a single package focus

Cabal PVP compliance checker

Have GHC track the different library "ways" it has installed

Strict Haskell language pragma

Add a comment

Integration with other monitoring systems

Statsd integration

Integration with your monitoring systems

Namespaced metrics

A new metric type for tracking distributions

Counters and gauges are always 64-bits

Improved performance and multi-core scaling

View comments

New features

Dependency freezing

Parallel cabal build

Flag to temporary ignore upper bounds

Unnecessary re-linking avoidance

Streaming cabal test output

New cabal exec command

Haddock configuration options

How to use cabal

Credits

View comments

Improved Hackage login

Other Hackage usability improvements

Faster Cabal/GHC parallel builds

Cabal file pretty-printer

View comments

Faster ghc -c building

Cabal dependency solver improvements

View comments

Better formatting support for Haddock

Option 1: Add Markdown as an alternative markup language

Option 2: Add features from Markdown to Haddock

More parallelism in Cabal builds

View comments

View comments