Archiving GitHub

Scraping and downloading Haskell-related repositories from GitHub
Haskell, archiving, tutorial
2011-03-202013-10-28 finished certainty: highly likely importance: 4


This Haskell tu­to­r­ial was writ­ten in early March 2011, and while the be­low code worked then, it may not work with Github or the nec­es­sary Haskell li­braries now. If you are in­ter­ested in down­load­ing from GitHub I sug­gest look­ing into the GitHub API or the Haskell github li­brary.

Why download?

Along the lines of , I like to keep copies of -re­lated source-code 1 be­cause the files & his­tory might come in handy on oc­ca­sion, and be­cause hav­ing a large col­lec­tion of repos­i­to­ries lets me search them for ran­dom pur­pos­es.

(For ex­am­ple, part of my lob­by­ing for Control.Monad.void2 was based on pro­duc­ing a list of dozens of source files which rewrote that par­tic­u­lar id­iom, and I have been able to use­fully com­ment & based on crude sta­tis­tics gath­ered by 3 through my hun­dreds of repos­i­to­ries.)

Pre­vi­ously I wrote a sim­ple script to down­load the repos­i­to­ries of the source repos­i­tory host­ing site Patch-tag.­com. Patch-tag spe­cial­izes in host­ing repos­i­to­ries (usu­ally Haskel­l-re­lat­ed). is a much larger & more pop­u­lar host­ing site, and though it does not sup­port Darcs but (as the name in­di­cates), it is so pop­u­lar that it still hosts a great deal of Haskell. I’ve down­loaded a few repos­i­to­ries out of cu­rios­ity or be­cause I was work­ing on the con­tents of the repos­i­tory (eg. gi­tit), but there are too many to down­load man­u­al­ly. I needed a script.

Patch-tag was nice enough to sup­ply a URL which pro­vided ex­actly the URLs I need­ed, but I could­n’t ex­pect such per­son­al­ized sup­port from GitHub. GitHub does sup­ply an API of sorts for de­vel­op­ers and hob­by­ists, said API pro­vides no ob­vi­ous way to get what I want: ‘URLs for all Haskel­l-re­lated re­pos’. So, scrap­ing it is—I’d write a script to munge some GitHub HTML and get the URLs I want that way.

Archiving GitHub

Parsing pages

The clos­est I can get to a tar­get URL is https://github.com/languages/Haskell/created. We’ll be pars­ing that. The first thing to do is to steal Tag­Soup code from my pre­vi­ous scrap­ers, so our very crud­est ver­sion looks like this:

import Text.HTML.TagSoup
import Text.HTML.Download (openURL)

main = do html <- openURL "https://github.com/languages/Haskell/created"
          let links = linkify html
          print links
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

Downloading pages (the lazy way)

We run it and it throws an ex­cep­tion! *** Exception: getAddrInfo: does not exist (No address associated with hostname)

Oops. We got all wrapped up in pars­ing the HTML we for­got to make sure that down­load­ing worked in the first place. Well, we’re lazy pro­gram­mers, so now, on de­mand, we’ll in­ves­ti­gate that prob­lem. The ex­cep­tion thrown sounds like a prob­lem with the openURL call—‘’ is a net­work­ing term, not a pars­ing or print­ing term. So we try run­ning just openURL "https://github.com/languages/Haskell/created"—same er­ror. Not help­ful.

We try a differ­ent im­ple­men­ta­tion of openURL, men­tioned in the other scrap­ing script:

import Network.HTTP (getRequest, simpleHTTP)
openURL = simpleHTTP . getRequest

Call­ing that again, we see:

> openURL "https://github.com/languages/Haskell/created"
Loading package HTTP-4000.1.1 ... linking ... done.
Right HTTP/1.1 301 Moved Permanently
Server: nginx/0.7.67
Date: Tue, 15 Mar 2011 22:54:31 GMT
Content-Type: text/html
Content-Length: 185
Connection: close
Location: https://github.com/languages/Haskell/created

Oh dear. It seems that the HTTP pack­age just won’t han­dle HTTPS; nor does the de­scrip­tion men­tion HTTPS nor any of the mod­ule names seem con­nect­ed. Best to give up en­tirely on it.

If we google ‘Haskell https’, one of the first 20 hits hap­pens to be a question/page which sounds promis­ing: “Haskell Net­work.Browser HTTPS Con­nec­tion”. The one an­swer says to sim­ply use the Haskell bind­ing to . Well, fine. I al­ready had that in­stalled be­cause Darcs uses the bind­ing for down­loads. I’ll use that pack­age.4

We go to the top level mod­ule hop­ing for an easy down­load. Scrolling down, one’s eye is caught by a curl­Get­String, which while not nec­es­sar­ily a promis­ing name, does have an in­ter­est­ing type: URLString -> [CurlOption] -> IO (CurlCode, String).

Note es­pe­cially the re­turn val­ue—from past ex­pe­ri­ence with the HTML pack­age, one would give a good chance that the URLString is just a type syn­onym for a URL string and the String re­turn just the HTML source we want. What CurlOption might be, I have no idea, but let’s try sim­ply omit­ting them all. So we load the mod­ule in GHCi (:module + Network.Curl) and see what curlGetString "https://github.com/languages/Haskell/created" [] does:

(CurlOK,"<!DOCTYPE html>
<html>
  <head>
    <meta charset='utf-8'>
    <meta http-equiv=\"X-UA-Compatible\" content=\"chrome=1\">
        <title>Recently Created Haskell Repositories - GitHub</title>
    <link rel=\"search\" type=\"application/opensearchdescription+xml\"
                          href=\"/opensearch.xml\" title=\"GitHub\" />
    <link rel=\"fluid-icon\" href=\"https://github.com/fluidicon.png\" title=\"GitHub\" />
  ...")

Great! As they say, ‘try the sim­plest pos­si­ble thing that could pos­si­bly work’, and this seems to. We don’t re­ally care about the exit code, since this is a hacky script5; we’ll throw it away and only keep the sec­ond part of the tu­ple with the usual snd. It’s in IO so we need to use liftM or fmap be­fore we can ap­ply snd. Com­bined with our pre­vi­ous Tag­soup code, we get:

import Text.HTML.TagSoup
import Network.Curl (curlGetString, URLString)

main :: IO ()
main = do html <- openURL "https://github.com/languages/Haskell/created"
          let links = linkify html
          print links

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

Spidering (the lazy way)

What’s the out­put of this?

["logo boring","https://github.com","/plans","/explore","/features","/blog",
    "/login?return_to=https://github.com/languages/Haskell/created",
    "/languages/Haskell","/explore","explore_main","/repositories","explore_repos","/languages",
    "selected","explore_languages","/timeline","explore_timeline","/search","code_search",
    "/tips","explore_tips","/languages/Haskell","/languages/Haskell/created",
    "selected","/languages/Haskell/updated", "/languages","/languages/ActionScript/created",
    "/languages/Ada/created","/languages/Arc/created","/languages/ASP/created",
    "/languages/Assembly/created",
    "/languages/Boo/created","/languages/C/created","/languages/C%23/created",
    "/languages/C++/created","/languages/Clojure/created",
    "/languages/CoffeeScript/created",
    "/languages/ColdFusion/created","/languages/Common%20Lisp/created","/languages/D/created",
    "/languages/Delphi/created","/languages/Duby/created",
    "/languages/Eiffel/created",
    "/languages/Emacs%20Lisp/created","/languages/Erlang/created","/languages/F%23/created",
    "/languages/Factor/created","/languages/FORTRAN/created",
    "/languages/Go/created",
    "/languages/Groovy/created","/languages/HaXe/created","/languages/Io/created",
    "/languages/Java/created","/languages/JavaScript/created",
    "/languages/Lua/created",
    "/languages/Max/MSP/created","/languages/Nu/created","/languages/Objective-C/created",
    "/languages/Objective-J/created","/languages/OCaml/created",
    "/languages/ooc/created",
    "/languages/Perl/created","/languages/PHP/created","/languages/Pure%20Data/created",
    "/languages/Python/created","/languages/R/created",
    "/languages/Racket/created",
    "/languages/Ruby/created","/languages/Scala/created","/languages/Scheme/created",
    "/languages/sclang/created","/languages/Self/created",
    "/languages/Shell/created",
    "/languages/Smalltalk/created","/languages/SuperCollider/created","/languages/Tcl/created",
    "/languages/Vala/created","/languages/Verilog/created",
    "/languages/VHDL/created",
    "/languages/VimL/created","/languages/Visual%20Basic/created","/languages/XQuery/created",
    "/brownnrl","/brownnrl/Real-World-Haskell","/joh",
    "/joh/tribot","/bjornbm","/bjornbm/publicstuff","/codemac","/codemac/yi","/poconnell93",
    "/poconnell93/chat","/jillianfu","/jillianfu/Angel","/jaspervdj","/jaspervdj/sup-host","/serras",
    "/serras/scion-ghc-7-requisites","/serras","/serras/scion","/iand675","/iand675/cgen","/shangaslammi",
    "/shangaslammi/haskeroids","/rukav","/rukav/ReplayTrace","/jaspervdj","/jaspervdj/wol","/tomlokhorst",
    "/tomlokhorst/wol","/bos","/bos/concurrent-barrier","/jkingry","/jkingry/projectEuler","/olshanskydr",
    "/olshanskydr/xml-enumerator","/lorenz","/lorenz/fypmaincode","/jaspervdj",
    "/jaspervdj/data-object-json","/jaspervdj","/jaspervdj/data-object-yaml",
    "/languages/Haskell/created?page=2","next","/languages/Haskell/created?page=3",
    "/languages/Haskell/created?page=4","/languages/Haskell/created?page=5",
    "/languages/Haskell/created?page=6","/languages/Haskell/created?page=7",
    "/languages/Haskell/created?page=8","/languages/Haskell/created?page=9",
    "/languages/Haskell/created?page=208","/languages/Haskell/created?page=209",
    "/languages/Haskell/created?page=2","l","next","http://www.rackspace.com","logo",
    "http://www.rackspace.com ","http://www.rackspacecloud.com","https://github.com/blog",
    "/login/multipass?to=http%3A%2F%2Fsupport.github.com","https://github.com/training",
    "http://jobs.github.com","http://shop.github.com",
    "https://github.com/contact","http://develop.github.com","http://status.github.com",
    "/site/terms","/site/privacy","https://github.com/security",
    "nofollow","?locale=de","nofollow","?locale=fr","nofollow","?locale=ja",
    "nofollow","?locale=pt-BR","nofollow","?locale=ru","nofollow","?locale=zh",
    "#","minibutton btn-forward js-all-locales","nofollow","?locale=en","nofollow","?locale=af",
    "nofollow","?locale=ca","nofollow","?locale=cs","nofollow","?locale=de",
    "nofollow","?locale=es","nofollow","?locale=fr","nofollow","?locale=hr",
    "nofollow","?locale=hu","nofollow","?locale=id","nofollow","?locale=it",
    "nofollow","?locale=ja","nofollow","?locale=nl","nofollow","?locale=no",
    "nofollow","?locale=pl","nofollow","?locale=pt-BR","nofollow","?locale=ru",
    "nofollow","?locale=sr","nofollow","?locale=sv","nofollow","?locale=zh",
    "#","js-see-all-keyboard-shortcuts"]

Quite a mouth­ful, but we can eas­ily fil­ter things down. “/languages/Haskell/created?page=3” is an ex­am­ple of a link to the next page list­ing Haskell repos­i­to­ries; pre­sum­ably the cur­rent page would be “?page=1”, and the high­est listed seems to be “/languages/Haskell/created?page=209”. The ac­tual repos­i­to­ries look like “/jaspervdj/data-object-yaml”.

The reg­u­lar­ity of the “cre­ated” num­ber­ing sug­gests that we can avoid any ac­tual spi­der­ing. In­stead, we could just fig­ure out what the last page is, the high­est page, and then gen­er­ate all the page names in be­tween be­cause they fol­low a sim­ple scheme.

As­sume we have the fi­nal num­ber, n, we al­ready know we get the full list through [1..n]; then we want to prepend “languages/Haskell/created?page=”, but it’s a type er­ror to sim­ply write map ("languages/Haskell/created?page="++) [1..n]. There is only one type­-vari­able in (++) :: [a] -> [a] -> [a]. To con­vert the In­te­gers to a proper String, we do map show, so that gives us our gen­er­a­tor:

listPages :: [String]
listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [1..]

(This will throw a warn­ing us­ing -Wall be­cause GHC has to guess whether the 1 is an Int or In­te­ger. This can be qui­eted by writ­ing (1::Int) in­stead.)

But what is x? We don’t know the fi­nal, high­est, old­est page. We don’t know how much of our in­fi­nite lazy list to take. It’s easy enough to filter the list to get only the in­dex: filter (isPrefixOf "/languages/Haskell/created?page=").

Then we call last, right? (Or some­thing like head . reverse if we did­n’t know last or if we did­n’t think to check the hits for [a] -> a). But if you look back at the orig­i­nal scrap­ing out­put, you see an ex­am­ple of how a sim­ple ap­proach can go wrong; we read “/languages/Haskell/created?page=209” and then we read “/languages/Haskell/created?page=2”! 2 is less than 209, of course, and is the wrong an­swer. GitHub is not padding the num­bers to look like “cre­at­ed?­page=002”, so our sim­ple-minded ap­proach does­n’t work.

So we need to ex­tract the num­ber. Easy enough: the pre­fix is sta­t­i­cally known and never changes, so we can hard­wire some crude pars­ing us­ing drop: drop 32. How to turn the re­main­ing String into an Int? Hope­fully one knows about read, but even here Hoogle will save our ba­con if we think to look through the list of hits for String -> Intread turns up as hit #10 or #11. Then, now that we have turned our [String] into [In­t], we could sort it and take the last en­try, or again go to the stan­dard li­brary and use maximum (like read, it will turn up for [Int] -> Int, if not as highly ranked as one might hope). Tweak­ing the syn­tax a lit­tle, our fi­nal re­sult is:

lastPage :: [String] -> Int
lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`)

If we did­n’t want to hard­wire this for Haskell, we’d prob­a­bly write the func­tion with an ad­di­tional pa­ra­me­ter and re­place the Int with a run­time cal­cu­la­tion of what to re­move:

lastPageGeneric :: String -> [String] -> Int
lastPageGeneric lang = maximum . map (read . drop (length lang)) . filter (lang `isPrefixOf`)

So let’s put what we have to­geth­er. The pro­gram can down­load an ini­tial in­dex page, parse it, find the name of the last in­dex page, and gen­er­ate the URLs of all in­dex pages, and print those out (to prove that it all work­s):

import Data.List (isPrefixOf)
import Network.Curl (curlGetString, URLString)
import Text.HTML.TagSoup

main :: IO ()
main = do html <- openURL "https://github.com/languages/Haskell/created"
          let lst = lastPage $ linkify html
          let indxPgs = take lst listPages
          print indxPgs

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

lastPage :: [String] -> Int
lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`)

listPages :: [String]
listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [(1::Int)..]

So where were we? We had a [String] (in a vari­able named indxPgs) which rep­re­sents all the in­dex pages. We can get the HTML source of each page just by reusing openURL (it works on the first one, so it stands to rea­son it’d work on all in­dex pages), which is triv­ial by this point: mapM openURL indxPgs.

Filtering repositories

In the Tag­Soup re­sult, we saw the ad­dresses of the repos­i­to­ries listed on the first in­dex page:

["/brownnrl","/brownnrl/Real-World-Haskell","/joh","/joh/tribot","/bjornbm","/bjornbm/publicstuff",
    "/codemac","/codemac/yi","/poconnell93","/poconnell93/chat","/jillianfu","/jillianfu/Angel",
    "/jaspervdj","/jaspervdj/sup-host","/serras","/serras/scion-ghc-7-requisites","/serras","/serras/scion",
    "/iand675","/iand675/cgen","/shangaslammi","/shangaslammi/haskeroids","/rukav","/rukav/ReplayTrace",
    "/jaspervdj","/jaspervdj/wol","/tomlokhorst","/tomlokhorst/wol","/bos","/bos/concurrent-barrier",
    "/jkingry","/jkingry/projectEuler","/olshanskydr","/olshanskydr/xml-enumerator","/lorenz",
    "/lorenz/fypmaincode","/jaspervdj","/jaspervdj/data-object-json","/jaspervdj",
    "/jaspervdj/data-object-yaml"]

With­out look­ing at the ren­dered page in our browser, it’s ob­vi­ous that GitHub is link­ing first to what­ever user owns or cre­ated the repos­i­to­ry, and then link­ing to the repos­i­tory it­self. We don’t want the users, but the repos­i­to­ries. For­tu­nate­ly, it’s equally ob­vi­ous that this is true: no user page has two for­ward-s­lashes in it, while all repos­i­tory pages have two for­ward-s­lashes in it.

So we want to count the for­ward-s­lashes and keep every ad­dress with ex­actly 2 for­ward-s­lash­es. The type for our func­tion takes a list, a pos­si­ble en­try in that list, and re­turns a count. This is easy to do with prim­i­tive re­cur­sion and an ac­cu­mu­la­tor, or per­haps length com­bined with filter; but the base li­brary al­ready has func­tions for a -> [a] -> Int. elemIndex an­noy­ingly re­turns a Maybe Int, so we’ll use elemIndices in­stead and call length on its out­put: length (elemIndices '/' x) == 2.

This is not quite right. If we run this on the orig­i­nal parsed out­put, we get

["https://github.com","/languages/Haskell","/languages/Haskell","/plategreaves/unordered-containers",
    "/vincenthz/hs-tls-extra","/aculich/fix-symbols-gitit",
    "/sphynx/euler-hs","/DRMacIver/unordered-containers","/hamishmack/yesod-slides","/GNUManiacs/hoppla",
    "/DRMacIver/hs-rank-aggregation","/naota/hackage-autoebuild",
    "/magthe/hsini","/dagit/gnuplot-test","/imbaczek/HBPoker","/sergeyastanin/simpleea","/cbaatz/hamu8080",
    "/aristidb/xml-enumerator",
    "/elliottt/value-supply","/gnumaniacs-org/hoppla","/emillon/tyson","/quelgar/hifive",
    "/quelgar/haskell-websockets","http://www.rackspace.com",
    "http://www.rackspace.com ","http://www.rackspacecloud.com",
    "/login/multipass?to=http%3A%2F%2Fsupport.github.com","http://jobs.github.com",
    "http://shop.github.com","http://develop.github.com",
    "http://status.github.com","/site/terms","/site/privacy"]

It does­n’t look like we mis­tak­enly omit­ted a repos­i­to­ry, but it does look like we mis­tak­enly in­cluded things we should not have. We need to fil­ter out any­thing be­gin­ning with a “http://”, “https://”, “/site/”, “/languages/”, or “/login/”.6

We could call filter mul­ti­ple times, or use a tricky foldr to ac­cu­mu­late only re­sults which don’t match any of the items in our list ["/languages/", "/login/", "/site/", "http://", "https://"]. But I al­ready wrote to this prob­lem back in the orig­i­nal WP RSS archive-bot where I no­ticed that my orig­i­nal gi­ant filter call could be re­placed by a much more el­e­gant use of any

 where  uniq :: [String] -> [String]
        uniq = filter (\x ->not $ any (flip isInfixOf x) exceptions)

        exceptions :: [String]
        exceptions = ["wikimediafoundation", "http://www.mediawiki.org/", "wikipedia",
                      "&curid=", "index.php?title=", "&action="]

In our case, we re­place isInfixOf with isPrefixOf, and we have differ­ent con­stants de­fined in exceptions. To put it all to­gether into a new fil­ter­ing func­tion, we have:

repos :: String -> [String]
repos = uniq . linkify
  where  uniq :: [String] -> [String]
         uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions)
         exceptions :: [String]
         exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"]
         count :: String -> Bool
         count x = length (elemIndices '/' x) == 2

Our new min­i­mal­ist pro­gram, which will test out repos:

import Data.List (elemIndices, isPrefixOf)
import Network.Curl (curlGetString, URLString)
import Text.HTML.TagSoup

main :: IO ()
main = do html <- openURL "https://github.com/languages/Haskell/created"
          print $ repos html

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

repos :: String -> [String]
repos = uniq . linkify
  where  uniq :: [String] -> [String]
         uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions)
         exceptions :: [String]
         exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"]
         count :: String -> Bool
         count x = length (elemIndices '/' x) == 2

The out­put:

["/plategreaves/unordered-containers","/vincenthz/hs-tls-extra","/aculich/fix-symbols-gitit",
    "/sphynx/euler-hs","/DRMacIver/unordered-containers","/hamishmack/yesod-slides",
    "/GNUManiacs/hoppla","/DRMacIver/hs-rank-aggregation","/naota/hackage-autoebuild","/magthe/hsini",
    "/dagit/gnuplot-test","/imbaczek/HBPoker",
    "/sergeyastanin/simpleea","/cbaatz/hamu808a0","/aristidb/xml-enumerator","/elliottt/value-supply",
    "/gnumaniacs-org/hoppla","/emillon/tyson",
    "/quelgar/hifive","/quelgar/haskell-websockets"]

Shelling out to git

That leaves the ‘shell out to git’ func­tion­al­i­ty. We could try steal­ing the spawn (call out to /bin/sh) code from XMon­ad, but the point of spawn is that it away com­pletely from our script, which will com­pletely screw up our de­sired lack of par­al­lelism.7 I ul­ti­mately wound up us­ing a func­tion from System.Process, readProcessWithExitCode. (Why readProcessWithExitCode and not readProcess? Be­cause if a di­rec­tory al­ready ex­ists, git/readProcess throws an ex­cep­tion which kills the scrip­t!) This will work:

shellToGit :: String -> IO ()
shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u] ""
                  print y

In ret­ro­spect, it might have been a bet­ter idea to try to use runCommand or System.Cmd. Al­ter­na­tive­ly, we could use the same shelling out func­tion­al­ity from the orig­i­nal patch-tag.­com script:

mapM_ (\x -> runProcess "darcs" ["get", "--lazy", "http://patch-tag.com"++x]
                          Nothing Nothing Nothing Nothing Nothing) targets

Which could be rewrit­ten for us (sans log­ging) as

shellToGit :: String -> IO ()
shellToGit u = runProcess "git" ["clone", u] Nothing Nothing Nothing Nothing Nothing >> return ()
-- We could replace `return ()` with Control.Monad.void to drop `IO ProcessHandle` result

Now it’s easy to fill in our 2 miss­ing lines:

          ...
          repourls <- mapM getRepos indxPgs
          let gitURLs = map gitify $ concat repourls
          mapM_ shellToGit gitURLs

(The concat is there be­cause getRepos gave us a [String] for each String, and then we ran it on a [String]—so our re­sult is [[String]]! But we don’t care about pre­serv­ing the in­for­ma­tion about where each String came from, so we smush it down to a sin­gle list. Strictly speak­ing, we did­n’t need to do print y in shellToGit, but while de­vel­op­ing, it’s a good idea to have some sort of log­ging—get a sense of what the script is do­ing. And once you are print­ing at all, you can sort the list of repos­i­tory URLs to down­load them in or­der by user.)

Unique repositories

There is one sub­tlety here worth not­ing that our script is run­ning rough-shod over. Each URL we down­load is unique, be­cause user­names are unique on GitHub and each URL is formed from a “/username/reponame” pair. But each down­loaded repos­i­tory is not unique, be­cause git will shuck off the user­name and cre­ate a di­rec­tory with just the repos­i­tory name—“/john/bar” and “/jack/bar” will clash, and if you down­load in that or­der, the bar repos­i­tory will be John’s repos­i­tory and not Jack’s repos­i­to­ry. Git will er­ror out the sec­ond time, but this er­ror is ig­nored by the shelling code. The so­lu­tion would be to tell git to clone to a non-de­fault but unique di­rec­tory (for ex­am­ple, one could reuse the “/username/reponame” and then one’s tar­get di­rec­tory would be neatly pop­u­lated by sev­eral hun­dred di­rec­to­ries named after users, each pop­u­lated by a few repos­i­to­ries with non-u­nique names). If we went with the per-user ap­proach, our new ver­sion would look like this:

shellToGit :: String -> IO ()
shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u, drop 19 u] ""
                  print y

Why the drop 19 u? Well, u is the fully qual­i­fied URL, eg. “https://www.github.com/sergeyastanin/simpleea”. Ob­vi­ously we don’t want to ex­e­cute git clone "https://www.github.com/sergeyastanin/simpleea" "https://www.github.com/sergeyastanin/simpleea" (even though that’d be per­fectly valid), be­cause it makes for ugly fold­ers. But drop 19 "https://www.github.com/sergeyastanin/simpleea" turns into “sergeyastanin/simpleea”, giv­ing us the right lo­cal di­rec­tory name with no pre­fixed slash.

Or you just pass in the orig­i­nal “/username/reponame” and use drop 1 on that in­stead. (Ei­ther way, you need to do ad­di­tional work. Might as well just use drop 19.)

One fi­nal note: many of the URLs end in “.git”. If we dis­liked this, then we could en­hance the drop 19 with System.FilePath.dropExtension: dropExtension $ drop 19 u.

The script

The fi­nal pro­gram, clean of -Wall or hlint warn­ings:

import Data.List (elemIndices, isPrefixOf, sort)
import Network.Curl (curlGetString, URLString)
import System.FilePath (dropExtension)
import System.Process (readProcessWithExitCode)
import Text.HTML.TagSoup

main :: IO ()
main = do html <- openURL "https://github.com/languages/Haskell/created"
          let lst = lastPage $ linkify html
          let indxPgs = take lst listPages
          repourls <- mapM getRepos indxPgs
          let gitURLs = map gitify $ sort $ concat repourls
          mapM_ shellToGit gitURLs

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

lastPage :: [String] -> Int
lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`)

listPages :: [String]
listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [(1::Int)..]

repos :: String -> [String]
repos = uniq . linkify
  where  uniq :: [String] -> [String]
         uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions)
         exceptions :: [String]
         exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"]
         count :: String -> Bool
         count x = length (elemIndices '/' x) == 2

getRepos :: String -> IO [String]
getRepos = fmap repos . openURL

gitify :: String -> String
gitify x = "https://github.com" ++ x ++ ".git"

shellToGit :: String -> IO ()
shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u, dropExtension $ drop 19 u] ""
                  print y

This, or a ver­sion of it, works well. But I cau­tion peo­ple from mis­-us­ing it! There are a lot of repos­i­to­ries on GitHub; please don’t go run­ning this care­less­ly. It will pull down 4-12 gi­ga­bytes of da­ta. GitHub is a good FLOSS-friendly busi­ness by all ac­counts, and does­n’t de­serve peo­ple wast­ing its band­width & money if they are not even go­ing to keep what they down­loaded.

The script golfed

For kicks, let’s see what a short­er, more un­main­tain­able and un­read­able, ver­sion looks like (in the best script­ing lan­guage tra­di­tion):

import Data.List (elemIndices, isPrefixOf, sort)
import Network.Curl (curlGetString)
import System.FilePath (dropExtension)
import System.Process (readProcessWithExitCode)
import Text.HTML.TagSoup

main = do html <- openURL "https://github.com/languages/Haskell/created"
          let i = take (lastPage $ linkify html) $
                   map (("https://github.com/languages/Haskell/created?page="++) . show) [1..]
          repourls <- mapM (fmap (uniq . linkify) . openURL) i
          mapM_ shellToGit $ map (\x -> "https://github.com" ++ x ++ ".git") $ sort $ concat repourls
       where openURL target = fmap snd $ curlGetString target []
             linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]
             lastPage = maximum . map (read . drop 32) .
                         filter ("/languages/Haskell/created?page=" `isPrefixOf`)
             uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x)
                     ["/languages/", "/login/", "/site/", "http://", "https://"])
             count x = length (elemIndices '/' x) == 2
             shellToGit u = do { (_,y,_) <- readProcessWithExitCode "git"
                                             ["clone", u, dropExtension $ drop 19 u] ""; print y }

14 lines of code is­n’t too bad, es­pe­cially con­sid­er­ing that Haskell is not usu­ally con­sid­er­ing a lan­guage suited for script­ing or scrap­ing pur­poses like this. Nor do I see any ob­vi­ous miss­ing ab­strac­tions—count is a func­tion that might be use­ful in Data.List, and openURL is some­thing that the Curl bind­ing could pro­vide on its own, but every­thing else looks pretty nec­es­sary.

Exercises for the reader

  1. Once one has all those repos­i­to­ries, how does one keep them up to date? The rel­e­vant com­mand is git pull. How would one run this on all the repos­i­to­ries? In shell script? Us­ing find? From a crontab?8
  2. In the pre­vi­ous script, a num­ber of short­-cuts were taken which ren­der it Haskel­l-spe­cific. Iden­tify and rem­edy them, turn­ing this script into a gen­er­al-pur­pose script for down­load­ing any lan­guage for which GitHub has a cat­e­go­ry. (The lan­guage name can be ac­cessed by read­ing an ar­gu­ment to the script by stan­dard func­tions like getArgs.)

  1. This page as­sumes a ba­sic un­der­stand­ing of how ver­sion con­trol pro­grams work, Haskell syn­tax, and the Haskell stan­dard li­brary. For those not au courant, repos­i­to­ries are ba­si­cally a col­lec­tion of log­i­cally re­lated files and a de­tailed his­tory of the mod­i­fi­ca­tions that built them up from noth­ing.↩︎

  2. From my ini­tial email:

    I’d think it [Control.Monad.void] [would] be use­ful for more than just me. Agda is lousy with calls to >> return (); and then there’s ZMa­chine, ar­rayref, whim, the bar­racuda pack­ages, bi­na­ry, bn­fc, bud­dha, bytestring, c2hs, ca­bal, chessli­brary, co­mas, con­jure, curl, darcs, darc­s-bench­mark, dbus-haskell, ddc, de­phd, de­rive, dhs, drift, easyvi­sion, ehc, file­store, folkung, geni, geordi, gtk2hs, gnu­plot, gin­su, halfs, happ­stack, haske­line, hback, hbeat… You get the pic­ture.

    ↩︎
  3. I oc­ca­sion­ally also use Haskell scripts based on haskel­l-s­r­c-exts.↩︎

  4. Be­sides curl, there is the http-wget wrap­per, and the http-enu­mer­a­tor pack­age claims to na­tively sup­port HTTPS. I have not tried them.↩︎

  5. What would we do? Keep retry­ing? There are go­ing to be tons of er­rors in this script any­way, from repos­i­to­ries in­cor­rectly iden­ti­fied as Haskel­l-re­lated to repos­i­tory du­pli­cates to tran­sient net­work er­rors, that we gain a great deal of com­plex­ity from retry­ing and may make the script less re­li­able.↩︎

  6. Oh no—a black­list! This should make us un­hap­py, be­cause as com­puter se­cu­rity has taught us, black­lists fall out of date quickly or were never cor­rect to be­gin with. Much bet­ter to whitelist, but how can we do that? Peo­ple could name their repos­i­to­ries any damn thing, and pick any ac­cursed user­name; for all our code knows, ‘/site/terms’ is the repos­i­to­ry—per­haps the user ‘site’ main­tains some sort a nat­ural lan­guage li­brary called ‘terms’.↩︎

  7. A quick point: in the pre­vi­ous scripts, I went to some effort to get greater par­al­lelism, but in this case, we don’t want to ham­mer GitHub with a few thou­sand si­mul­ta­ne­ous git clone in­vo­ca­tions; Haskell repos­i­to­ries are cre­ated rarely enough that we can afford to be friendly and only down­load one repos­i­tory at a time.↩︎

  8. Ex­am­ple cron an­swer: @weekly find ~/bin -type d -name ".git" -execdir nice git pull \;↩︎