Scraping and downloading Haskell-related repositories from GitHub
topics: Haskell, archiving
created: 20 Mar 2011; modified: 13 Dec 2018; status: finished; confidence: highly likely; importance: 4

This Haskell tutorial was written in early March 2011, and while the below code worked then, it may not work with Github or the necessary Haskell libraries now. If you are interested in downloading from GitHub I suggest looking into the GitHub API or the Haskell github library.

Why download?

Along the lines of Archiving URLs, I like to keep copies of Haskell-related source-code repositories1 because the files & history might come in handy on occasion, and because having a large collection of repositories lets me search them for random purposes.

(For example, part of my lobbying for Control.Monad.void2 was based on producing a list of dozens of source files which rewrote that particular idiom, and I have been able to usefully comment & judge based on crude statistics gathered by grepping3 through my hundreds of repositories.)

Previously I wrote a simple script to download the repositories of the source repository hosting site Patch-tag specializes in hosting Darcs repositories (usually Haskell-related). GitHub is a much larger & more popular hosting site, and though it does not support Darcs but git (as the name indicates), it is so popular that it still hosts a great deal of Haskell. I’ve downloaded a few repositories out of curiosity or because I was working on the contents of the repository (eg. gitit), but there are too many to download manually. I needed a script.

Patch-tag was nice enough to supply a URL which provided exactly the URLs I needed, but I couldn’t expect such personalized support from GitHub. GitHub does supply an API of sorts for developers and hobbyists, said API provides no obvious way to get what I want: URLs for all Haskell-related repos. So, scraping it is - I’d write a script to munge some GitHub HTML and get the URLs I want that way.

Archiving GitHub

Parsing pages

The closest I can get to a target URL is We’ll be parsing that. The first thing to do is to steal TagSoup code from my previous scrapers, so our very crudest version looks like this:

Downloading pages (the lazy way)

We run it and it throws an exception! *** Exception: getAddrInfo: does not exist (No address associated with hostname)

Oops. We got all wrapped up in parsing the HTML we forgot to make sure that downloading worked in the first place. Well, we’re lazy programmers, so now, on demand, we’ll investigate that problem. The exception thrown sounds like a problem with the openURL call - hostname is a networking term, not a parsing or printing term. So we try running just openURL "" - same error. Not helpful.

We try a different implementation of openURL, mentioned in the other scraping script:

Calling that again, we see:

> openURL ""
Loading package HTTP-4000.1.1 ... linking ... done.
Right HTTP/1.1 301 Moved Permanently
Server: nginx/0.7.67
Date: Tue, 15 Mar 2011 22:54:31 GMT
Content-Type: text/html
Content-Length: 185
Connection: close

Oh dear. It seems that the HTTP package just won’t handle HTTPS; nor does the description mention HTTPS nor any of the module names seem connected. Best to give up entirely on it.

If we google Haskell https, one of the first 20 hits happens to be a Stack Overflow question/page which sounds promising: Haskell Network.Browser HTTPS Connection. The one answer says to simply use the Haskell binding to curl. Well, fine. I already had that installed because Darcs uses the binding for downloads. I’ll use that package.4

We go to the top level module hoping for an easy download. Scrolling down, one’s eye is caught by a curlGetString, which while not necessarily a promising name, does have an interesting type: URLString -> [CurlOption] -> IO (CurlCode, String).

Note especially the return value - from past experience with the HTML package, one would give a good chance that the URLString is just a type synonym for a URL string and the String return just the HTML source we want. What CurlOption might be, I have no idea, but let’s try simply omitting them all. So we load the module in GHCi (:module + Network.Curl) and see what curlGetString "" [] does:

Great! As they say, try the simplest possible thing that could possibly work, and this seems to. We don’t really care about the exit code, since this is a hacky script5; we’ll throw it away and only keep the second part of the tuple with the usual snd. It’s in IO so we need to use liftM or fmap before we can apply snd. Combined with our previous Tagsoup code, we get:

Spidering (the lazy way)

What’s the output of this?

["logo boring","","/plans","/explore","/features","/blog",
    "selected","/languages/Haskell/updated", "/languages","/languages/ActionScript/created",
    " ","","",
    "#","minibutton btn-forward js-all-locales","nofollow","?locale=en","nofollow","?locale=af",

Quite a mouthful, but we can easily filter things down. /languages/Haskell/created?page=3 is an example of a link to the next page listing Haskell repositories; presumably the current page would be ?page=1, and the highest listed seems to be /languages/Haskell/created?page=209. The actual repositories look like /jaspervdj/data-object-yaml.

The regularity of the created numbering suggests that we can avoid any actual spidering. Instead, we could just figure out what the last page is, the highest page, and then generate all the page names in between because they follow a simple scheme.

Assume we have the final number, n, we already know we get the full list through [1..n]; then we want to prepend languages/Haskell/created?page=, but it’s a type error to simply write map ("languages/Haskell/created?page="++) [1..n]. There is only one type-variable in (++) :: [a] -> [a] -> [a]. To convert the Integers to a proper String, we do map show, so that gives us our generator:

(This will throw a warning using -Wall because GHC has to guess whether the 1 is an Int or Integer. This can be quieted by writing (1::Int) instead.)

But what is x? We don’t know the final, highest, oldest page. We don’t know how much of our infinite lazy list to take. It’s easy enough to filter the list to get only the index: filter (isPrefixOf "/languages/Haskell/created?page=").

Then we call last, right? (Or something like head . reverse if we didn’t know last or if we didn’t think to check the hits for [a] -> a). But if you look back at the original scraping output, you see an example of how a simple approach can go wrong; we read /languages/Haskell/created?page=209 and then we read /languages/Haskell/created?page=2! 2 is less than 209, of course, and is the wrong answer. GitHub is not padding the numbers to look like created?page=002, so our simple-minded approach doesn’t work.

So we need to extract the number. Easy enough: the prefix is statically known and never changes, so we can hardwire some crude parsing using drop: drop 32. How to turn the remaining String into an Int? Hopefully one knows about read, but even here Hoogle will save our bacon if we think to look through the list of hits for String -> Int - read turns up as hit #10 or #11. Then, now that we have turned our [String] into [Int], we could sort it and take the last entry, or again go to the standard library and use maximum (like read, it will turn up for [Int] -> Int, if not as highly ranked as one might hope). Tweaking the syntax a little, our final result is:

If we didn’t want to hardwire this for Haskell, we’d probably write the function with an additional parameter and replace the Int with a runtime calculation of what to remove:

So let’s put what we have together. The program can download an initial index page, parse it, find the name of the last index page, and generate the URLs of all index pages, and print those out (to prove that it all works):

So where were we? We had a [String] (in a variable named indxPgs) which represents all the index pages. We can get the HTML source of each page just by reusing openURL (it works on the first one, so it stands to reason it’d work on all index pages), which is trivial by this point: mapM openURL indxPgs.

Filtering repositories

In the TagSoup result, we saw the addresses of the repositories listed on the first index page:

Without looking at the rendered page in our browser, it’s obvious that GitHub is linking first to whatever user owns or created the repository, and then linking to the repository itself. We don’t want the users, but the repositories. Fortunately, it’s equally obvious that this is true: no user page has two forward-slashes in it, while all repository pages have two forward-slashes in it.

So we want to count the forward-slashes and keep every address with exactly 2 forward-slashes. The type for our function takes a list, a possible entry in that list, and returns a count. This is easy to do with primitive recursion and an accumulator, or perhaps length combined with filter; but the base library already has functions for a -> [a] -> Int. elemIndex annoyingly returns a Maybe Int, so we’ll use elemIndices instead and call length on its output: length (elemIndices '/' x) == 2.

This is not quite right. If we run this on the original parsed output, we get

It doesn’t look like we mistakenly omitted a repository, but it does look like we mistakenly included things we should not have. We need to filter out anything beginning with a http://, https://, /site/, /languages/, or /login/.6

We could call filter multiple times, or use a tricky foldr to accumulate only results which don’t match any of the items in our list ["/languages/", "/login/", "/site/", "http://", "https://"]. But I already wrote the solution to this problem back in the original WP RSS archive-bot where I noticed that my original giant filter call could be replaced by a much more elegant use of any

In our case, we replace isInfixOf with isPrefixOf, and we have different constants defined in exceptions. To put it all together into a new filtering function, we have:

Our new minimalist program, which will test out repos:

The output:

Shelling out to git

That leaves the shell out to git functionality. We could try stealing the spawn (call out to /bin/sh) code from XMonad, but the point of spawn is that it forks away completely from our script, which will completely screw up our desired lack of parallelism.7 I ultimately wound up using a function from System.Process, readProcessWithExitCode. (Why readProcessWithExitCode and not readProcess? Because if a directory already exists, git/readProcess throws an exception which kills the script!) This will work:

In retrospect, it might have been a better idea to try to use runCommand or System.Cmd. Alternatively, we could use the same shelling out functionality from the original script:

Which could be rewritten for us (sans logging) as

Now it’s easy to fill in our 2 missing lines:

(The concat is there because getRepos gave us a [String] for each String, and then we ran it on a [String] - so our result is [[String]]! But we don’t care about preserving the information about where each String came from, so we smush it down to a single list. Strictly speaking, we didn’t need to do print y in shellToGit, but while developing, it’s a good idea to have some sort of logging - get a sense of what the script is doing. And once you are printing at all, you can sort the list of repository URLs to download them in order by user.)

Unique repositories

There is one subtlety here worth noting that our script is running rough-shod over. Each URL we download is unique, because usernames are unique on GitHub and each URL is formed from a /username/reponame pair. But each downloaded repository is not unique, because git will shuck off the username and create a directory with just the repository name - /john/bar and /jack/bar will clash, and if you download in that order, the bar repository will be John’s repository and not Jack’s repository. Git will error out the second time, but this error is ignored by the shelling code. The solution would be to tell git to clone to a non-default but unique directory (for example, one could reuse the /username/reponame and then one’s target directory would be neatly populated by several hundred directories named after users, each populated by a few repositories with non-unique names). If we went with the per-user approach, our new version would look like this:

Why the drop 19 u? Well, u is the fully qualified URL, eg. Obviously we don’t want to execute git clone "" "" (even though that’d be perfectly valid), because it makes for ugly folders. But drop 19 "" turns into sergeyastanin/simpleea, giving us the right local directory name with no prefixed slash.

Or you just pass in the original /username/reponame and use drop 1 on that instead. (Either way, you need to do additional work. Might as well just use drop 19.)

One final note: many of the URLs end in .git. If we disliked this, then we could enhance the drop 19 with System.FilePath.dropExtension: dropExtension $ drop 19 u.

The script

The final program, clean of -Wall or hlint warnings:

import Data.List (elemIndices, isPrefixOf, sort)
import Network.Curl (curlGetString, URLString)
import System.FilePath (dropExtension)
import System.Process (readProcessWithExitCode)
import Text.HTML.TagSoup

main :: IO ()
main = do html <- openURL ""
          let lst = lastPage $ linkify html
          let indxPgs = take lst listPages
          repourls <- mapM getRepos indxPgs
          let gitURLs = map gitify $ sort $ concat repourls
          mapM_ shellToGit gitURLs

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

lastPage :: [String] -> Int
lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`)

listPages :: [String]
listPages = map (\x -> "" ++ show x) [(1::Int)..]

repos :: String -> [String]
repos = uniq . linkify
  where  uniq :: [String] -> [String]
         uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions)
         exceptions :: [String]
         exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"]
         count :: String -> Bool
         count x = length (elemIndices '/' x) == 2

getRepos :: String -> IO [String]
getRepos = fmap repos . openURL

gitify :: String -> String
gitify x = "" ++ x ++ ".git"

shellToGit :: String -> IO ()
shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u, dropExtension $ drop 19 u] ""
                  print y

This, or a version of it, works well. But I caution people from mis-using it! There are a lot of repositories on GitHub; please don’t go running this carelessly. It will pull down 4-12 gigabytes of data. GitHub is a good FLOSS-friendly business by all accounts, and doesn’t deserve people wasting its bandwidth & money if they are not even going to keep what they downloaded.

The script golfed

For kicks, let’s see what a shorter, more unmaintainable and unreadable, version looks like (in the best scripting language tradition):

14 lines of code isn’t too bad, especially considering that Haskell is not usually considering a language suited for scripting or scraping purposes like this. Nor do I see any obvious missing abstractions - count is a function that might be useful in Data.List, and openURL is something that the Curl binding could provide on its own, but everything else looks pretty necessary.

Exercises for the reader

  1. Once one has all those repositories, how does one keep them up to date? The relevant command is git pull. How would one run this on all the repositories? In shell script? Using find? From a crontab?8
  2. In the previous script, a number of short-cuts were taken which render it Haskell-specific. Identify and remedy them, turning this script into a general-purpose script for downloading any language for which GitHub has a category. (The language name can be accessed by reading an argument to the script by standard functions like getArgs.)

  1. This page assumes a basic understanding of how version control programs work, Haskell syntax, and the Haskell standard library. For those not au courant, repositories are basically a collection of logically related files and a detailed history of the modifications that built them up from nothing.

  2. From my initial email:

    I’d think it [Control.Monad.void] [would] be useful for more than just me. Agda is lousy with calls to >> return (); and then there’s ZMachine, arrayref, whim, the barracuda packages, binary, bnfc, buddha, bytestring, c2hs, cabal, chesslibrary, comas, conjure, curl, darcs, darcs-benchmark, dbus-haskell, ddc, dephd, derive, dhs, drift, easyvision, ehc, filestore, folkung, geni, geordi, gtk2hs, gnuplot, ginsu, halfs, happstack, haskeline, hback, hbeat… You get the picture.

  3. I occasionally also use Haskell scripts based on haskell-src-exts.

  4. Besides curl, there is the http-wget wrapper, and the http-enumerator package claims to natively support HTTPS. I have not tried them.

  5. What would we do? Keep retrying? There are going to be tons of errors in this script anyway, from repositories incorrectly identified as Haskell-related to repository duplicates to transient network errors, that we gain a great deal of complexity from retrying and may make the script less reliable.

  6. Oh no - a blacklist! This should make us unhappy, because as computer security has taught us, blacklists fall out of date quickly or were never correct to begin with. Much better to whitelist, but how can we do that? People could name their repositories any damn thing, and pick any accursed username; for all our code knows, /site/terms is the repository - perhaps the user site maintains some sort a natural language library called terms.

  7. A quick point: in the previous scripts, I went to some effort to get greater parallelism, but in this case, we don’t want to hammer GitHub with a few thousand simultaneous git clone invocations; Haskell repositories are created rarely enough that we can afford to be friendly and only download one repository at a time.

  8. Example cron answer: @weekly find ~/bin -type d -name ".git" -execdir nice git pull \;