Hackage at Baltimore

After the ICFP 2010 conference proper had come to a close, I came along for some Haskell-related revelry in Baltimore, from Thursday evening (Sept 30) to Sunday night (Oct 3). My paper-reading queue is chock full, and my wallet less so. Anyway, I had a good time and met some amazing people. I was there for a good reason, though.

I gave a talk at the Haskell Implementors’ Workshop about Hackage, which you can find at http://vimeo.com/15464003. It’s 35 minutes in total.

The presentation part is a straightforward overview. Open discussion starts at about 16:30. You can get the slides here [PDF].

I hope it gives you a better idea of where Hackage is going. During the weekend, I had some great discussions about Hackage, and comparing functional languages, and finance (of all things). Even better, there was actual solid planning. On Sunday, Duncan Coutts and I materialized a plan for switching over to the new Hackage. It’s up online at the Hackage trac wiki, and the current revision looks something like this:

*Live mirroring (user-immutable, all accounts are historical)
- Get archive.tar.gz of all ~10,000 packages on Hackage
- Investigate unmirrorable packages (e.g. binembed-example, network-info, old-time)
- Get cabal-installs pointing at it
Implement backup for newer features (not all essential):
- Download statistics
- Candidates
- Preferred versions + deprecation
Get data migration (schema updates) working more smoothly
*Live server beta testing (user-mutable, all accounts are active)
- Disable registration; main Hackage accounts imported in
- Still mirroring the main Hackage
- Changes made here will be wiped out when server is fully deployed
Configure server with Apache to support the tracs, support https on Hackage
When ready to deploy: turn off upload on current Hackage
Construct export tarball with these features:
- core (packages, user db, admin list)
- upload (trustees, maintainers)
- tags (based on categories, initially)
- distro (from current files: arch + debian, eventually exherbo + ubuntu)
- download (from logs, give expected format to Galois log holders)
- versions (deprecated packages, preferred-versions)
Wipe server state and restore from tarball
*Switch!

Throughout all this there will be testing for backups and performance. The starred items are the significant ones that’ll be announced. They look like “use it with cabal-install!”, “use it as you please unofficially!”, and “use it as you please officially!”. If you’d like to learn more about some of the ideas behind hackage-server, the architecture document is a good starting point, as well as past blog posts and the features themselves.

Policy on the New Hackage Server

I’ve been working on the newer hackage-server as part of Google Summer of Code. It has user accounts, editable access control lists (user groups), and a system to hook in any number of pre-upload checks. It can utilize all of these to set the policy for how it filters its data. So how does it do that presently?

User groups

There are three important user groups: admins, package trustees, and package maintainers. Some server updates require membership in these groups; membership can be edited with a simple interface.

Admins perform administrative tasks. They can create accounts, change anyone’s password, delete an account, make server backups, and modify the members of the other user groups. They can also modify the package index in ways not allowed by normal uploads.
There is one package maintainer group per package. When a package is uploaded and no versions of it existed previously, the maintainer group is created with the uploader as the sole member. Maintainers can add other maintainers. Members of this group can then upload new versions of the package, edit its preferred versions and deprecated status, upload documentation, manage build reports, and other maintenance tasks. Of course, they don’t have to.
Package trustees are package maintainers for all packages. They can add and remove maintainers for any package, and perform any action per package that maintainers can.

It’s not set in stone, or even etched on papyrus, who the admins and trustees are actually going to be. Initially, package maintainers will be anyone who’s uploaded a version of a given package.

Other features provide their own user groups as well. One thing about their implementation is that they are entirely decentralized: there’s no section in the code which lists all of the user groups. There is a user → group mapping, but it’s updated only in response to the groups themselves being modified. Other groups, editable by admins, include:

Distro maintainers: can indicate which packages are available under which Linux distributions in their binary repositories. This information is available on package pages for those who prefer distro packages, as well as in list form.
Mirrorers: these are accounts for scripts which copy packages from one Hackage to another. Presently this is implemented in a batch-difference mode from hackage-scripts to hackage-server and is run periodically.

Uploading

Uploading follows these steps:

Uploader POSTs to /packages/ with package=[package tarball]
Make sure the user is logged in and get their user info
Put the package file in a temporary directory for incoming files
Get the package’s cabal file, parse it, and check it’s valid. Get the package name and version.
Fail if the package version is already in the main database
If maintainers exist for the package, make sure the user is in the maintainer group
Run pre-upload hooks; these can indicate errors and cause the upload to fail
Move the package to blob file storage and add it to the main index
Run all of the post-upload hooks, updating secondary indices to keep them in sync
Redirect to the new package page (or display an error)

Account registration

I wish I could give you the process for account registration, but the truth is that it’s still undecided. The present system involves requesting an account via email. This could still work with the new hackage-server, technically. There are a few reasons why this kind of process could be refined: there can be several admins; account creation no longer requires access to the server’s filesystem (.htpasswd); and account maintenance of all of Hackage is a lot for one person.

Possible approaches include:

Admins create accounts, possibly requested from some kind of web interface.
Anyone can self-register partial accounts which can do everything but upload, but can e.g. edit tags, write comments, or vote. These can be transformed from partial to total accounts by admins (perhaps also using a ticket system).
Let anyone self-register for an account and start uploading (it’s worked for rubygems.org).

So…

The newer hackage-server comes with some nifty defaults, including more detailed ways to maintain packages. There are some guiding principles to consider when making policies: for example, packages shouldn’t be any harder to upload than they are now (which is not very). Another principle is quality assurance (see “A radical Hackage social experiment“). The above system has been developed with these in mind. Last but not least, there’s the community’s experience with current Hackage policies. How can they be improved?

August 14, 2010. Uncategorized. 6 comments.

Hackage on Sparky

Hi, Haskellers. It’s been a while since I finished most of Hackage 2.0’s internal infrastructure. The site still needs a visual makeover, but I feel that enough of the core functionality is exposed for it to be useful to you guys. The latest from the darcs repository is running at:

http://sparky.haskell.org:8080/

This is imported from Hackage package data a day or so ago—no user account data. The features currently enabled on the server are package pages, uploading packages, uploading candidates, distribution information, user groups, documentation, build reports, preferred versions, package deprecation, reverse dependencies, download statistics, tags, name search, and a handful of others.

The most important feature, though, and the reason this was a complete rewrite instead of just extending the old server, is that the internal design is modular and meant to be extended easily. If there’s a feature you don’t like (say, doing download statistics), it should take very little time to gut it from the application and not compile it in at all. The NameSearch module, as an example, adds two search indices, a simple search page (at /packages/find), and an OpenSearch plugin with suggestions. Installing it entails adding a line to Features.hs and writing an HTML view for it.

Performance

As far as performance goes: the process of routing a URI, querying data from several sources, and rendering the resultant page takes anywhere from 15ms (for an unadorned package page) to 3 seconds (for long lists of packages with descriptions and tags) on the sparky server. This is the amount of time it takes to fully generate the document as a ByteString, which is then given to the Happstack web framework. Here are some example times. I expect that switching from xhtml to BlazeHTML, based on the benchmarks so far, would definitely reduce the rendering time; I’m looking into other places to cut corners, though I’m no expert here.

Routing itself takes around 1ms, based on the dynamic approach I described in this post. On my laptop, which has faster cores but far fewer of them, crafting a response takes anywhere from 2ms to half a second, and routing takes around 0.2ms, for the same server configuration and package collection.

Unfortunately, sparky itself seems a bit laggy: yesterday it took 30 seconds (!) to request and retrieve a 350KB HTML document which is fully cached in memory, even though it took a fraction of a millisecond to get a ByteString for it. I’m looking into this.

Try it out!

So, take a look around and tell me what you think! If you want to try out your own copy, these should work as a bash shell scripts, if you have ghc+cabal-install+alex+happy on your system: import current Hackage data or start a completely new server. (These install the server and use its command line interface.) Importing the current Hackage dataset requires somewhere in the neighborhood of 750MB of memory (I’m looking to reduce this) and 600MB to run the server (sparky has 32GB of memory). A brand new server requires just 2MB of memory.

To do

The primary goal this summer was to create a server architecture that could handle whatever we as a community need, and implement as much of it in Haskell as possible. I’m only one person, so there’s still a lot left to do, short-term and long-term, to get a better Hackage. I’ve outlined some of these tasks below.

What needs to be done before deploying to hackage.haskell.org?

Documentation. It’s one of the most important things Hackage provides. hackage-server lets maintainers upload documentation tarballs, but ticket 517 should be resolved so documentation can be more easily generated with Cabal.
Importing download statistics from the last few years. Granted, this is a minor one, but it’s a big help to have these without a gap in recording.
Stress-testing, in terms of making sure the server performs well and maintains the consistency of internal indices. Make sparky a bit more responsive. Ensure compatibility with cabal-install, including old versions. Double-check security in order to minimize the risk of attacks (replay, DDOS, etc.).
Deciding policy for things like account creation and uploading. I’ll put up a blog post soon about the policy that hackage-server currently has for these sorts of things, including an overview of the user group system.
Implementing backup for some of the newer features and creating an interface for admins to download backup tarballs.
Make sure the URI scheme is convenient for everyone.
Make robots.txt and set noindex on pages as appropriate.
Arrange for distribution maintainers (for Debian and Arch, presently) to send us updates about which packages they have available. Haskell packages in distribution repositories tend to be simpler to install and more stable, so connecting to them is important.
We need site admins and package trustees!

In the short-term future? (these should be implemented, sooner better than later)

Build reports: get a system working for cabal-install clients to send build reports, anonymous or non-anonymous, as a replacement/enhancement of the build bot’s functionality. At present Hackage can accept basic build reports, but this should be gotten right before it’s enabled, particularly for anonymous reports.
Web interface redesign. Since Hackage has more information to serve, it needs a better way to visually organize it. Anyone with web design chops is welcome. Other things to do here: expose JSON representations for Ajax functionality; rewrite HTML generating-code to use Blaze.
Serve the internals of packages and set up a sitemap.xml so they can go on Google Code Search.
Allow modifications to the cabal file without bumping the package version number. Admins can do this, but under some circumstances package maintainers might want to as well.
See if user group information can be stored better internally.
Get an STMP client running on the server to send automated email notifications.
More server-side logging of actions (with user and timestamp): this makes it easier to find out what’s going on and provide historical data.

In the long term future? (looking into the crystal ball)

Social features. This includes reviews, voting, contributing content: the little things that let you know your fellow Haskellers are humans and not code-generating automatons (besides mailing lists, IRC, reddit, meetups, conferences, blog posts…). The more effectively we can connect maintainers and users, the better. Most of these social features would be simple to implement technically. It’s more difficult to decide which features would actually benefit us as a community and get better-quality packages.
Allow the creation of arbitrary groups of packages. Currently, there’s a Haskell Platform feature, which puts a little star next to every package that’s in the platform. Why not lay the groundwork for other package groups?
Insert your idea here

There’s a document in progress about the server internals, and how you can extend Hackage with new features. For the next week, I’ll be tidying up the code, bug-hunting, writing documentation, and seeing what I can do with transition preparations. Come join #hackage on freenode, if you like, since we’ll be discussing some of these things in the coming weeks.

August 8, 2010. Uncategorized. 10 comments.

Hackage at Baltimore

Policy on the New Hackage Server

User groups

Uploading

Account registration

So…

Hackage on Sparky

Performance

Try it out!

To do

sidebar

Other Stuff

Friends

Archives

Categories