Archiving GitHub

Scraping and downloading Haskell-related repositories from GitHub
Haskell, archiving, tutorial
2011-03-202013-10-28 finished certainty: highly likely importance: 4


This Haskell tuto­r­ial was writ­ten in early March 2011, and while the below code worked then, it may not work with Github or the nec­es­sary Haskell libraries now. If you are inter­ested in down­load­ing from GitHub I sug­gest look­ing into the GitHub API or the Haskell github library.

Why download?

Along the lines of , I like to keep copies of -re­lated source-­code 1 because the files & his­tory might come in handy on occa­sion, and because hav­ing a large col­lec­tion of repos­i­to­ries lets me search them for ran­dom pur­pos­es.

(For exam­ple, part of my lob­by­ing for Control.Monad.void2 was based on pro­duc­ing a list of dozens of source files which rewrote that par­tic­u­lar idiom, and I have been able to use­fully com­ment & based on crude sta­tis­tics gath­ered by 3 through my hun­dreds of repos­i­to­ries.)

Pre­vi­ously I wrote a sim­ple script to down­load the repos­i­to­ries of the source repos­i­tory host­ing site Patch-­tag.­com. Patch-­tag spe­cial­izes in host­ing repos­i­to­ries (usu­ally Haskel­l-re­lat­ed). is a much larger & more pop­u­lar host­ing site, and though it does not sup­port Darcs but (as the name indi­cates), it is so pop­u­lar that it still hosts a great deal of Haskell. I’ve down­loaded a few repos­i­to­ries out of curios­ity or because I was work­ing on the con­tents of the repos­i­tory (eg. gitit), but there are too many to down­load man­u­al­ly. I needed a script.

Patch-­tag was nice enough to sup­ply a URL which pro­vided exactly the URLs I need­ed, but I could­n’t expect such per­son­al­ized sup­port from GitHub. GitHub does sup­ply an API of sorts for devel­op­ers and hob­by­ists, said API pro­vides no obvi­ous way to get what I want: ‘URLs for all Haskel­l-re­lated repos’. So, it is—I’d write a script to munge some GitHub HTML and get the URLs I want that way.

Archiving GitHub

Parsing pages

The clos­est I can get to a tar­get URL is https://github.com/languages/Haskell/created. We’ll be pars­ing that. The first thing to do is to steal Tag­Soup code from my pre­vi­ous scrap­ers, so our very crud­est ver­sion looks like this:

import Text.HTML.TagSoup
import Text.HTML.Download (openURL)

main = do html <- openURL "https://github.com/languages/Haskell/created"
          let links = linkify html
          print links
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

Downloading pages (the lazy way)

We run it and it throws an excep­tion! *** Exception: getAddrInfo: does not exist (No address associated with hostname)

Oops. We got all wrapped up in pars­ing the HTML we for­got to make sure that down­load­ing worked in the first place. Well, we’re lazy pro­gram­mers, so now, on demand, we’ll inves­ti­gate that prob­lem. The excep­tion thrown sounds like a prob­lem with the openURL call—‘’ is a net­work­ing term, not a pars­ing or print­ing term. So we try run­ning just openURL "https://github.com/languages/Haskell/created"—same error. Not help­ful.

We try a dif­fer­ent imple­men­ta­tion of openURL, men­tioned in the other scrap­ing script:

import Network.HTTP (getRequest, simpleHTTP)
openURL = simpleHTTP . getRequest

Call­ing that again, we see:

> openURL "https://github.com/languages/Haskell/created"
Loading package HTTP-4000.1.1 ... linking ... done.
Right HTTP/1.1 301 Moved Permanently
Server: nginx/0.7.67
Date: Tue, 15 Mar 2011 22:54:31 GMT
Content-Type: text/html
Content-Length: 185
Connection: close
Location: https://github.com/languages/Haskell/created

Oh dear. It seems that the HTTP pack­age just won’t han­dle HTTPS; nor does the descrip­tion men­tion HTTPS nor any of the mod­ule names seem con­nect­ed. Best to give up entirely on it.

If we google ‘Haskell https’, one of the first 20 hits hap­pens to be a question/page which sounds promis­ing: “Haskell Net­work.Browser HTTPS Con­nec­tion”. The one answer says to sim­ply use the Haskell bind­ing to . Well, fine. I already had that installed because Darcs uses the bind­ing for down­loads. I’ll use that pack­age.4

We go to the top level mod­ule hop­ing for an easy down­load. Scrolling down, one’s eye is caught by a curl­Get­String, which while not nec­es­sar­ily a promis­ing name, does have an inter­est­ing type: URLString -> [CurlOption] -> IO (CurlCode, String).

Note espe­cially the return val­ue—from past expe­ri­ence with the HTML pack­age, one would give a good chance that the URLString is just a type syn­onym for a URL string and the String return just the HTML source we want. What CurlOption might be, I have no idea, but let’s try sim­ply omit­ting them all. So we load the mod­ule in GHCi (:module + Network.Curl) and see what curlGetString "https://github.com/languages/Haskell/created" [] does:

(CurlOK,"<!DOCTYPE html>
<html>
  <head>
    <meta charset='utf-8'>
    <meta http-equiv=\"X-UA-Compatible\" content=\"chrome=1\">
        <title>Recently Created Haskell Repositories - GitHub</title>
    <link rel=\"search\" type=\"application/opensearchdescription+xml\"
                          href=\"/opensearch.xml\" title=\"GitHub\" />
    <link rel=\"fluid-icon\" href=\"https://github.com/fluidicon.png\" title=\"GitHub\" />
  ...")

Great! As they say, ‘try the sim­plest pos­si­ble thing that could pos­si­bly work’, and this seems to. We don’t really care about the exit code, since this is a hacky script5; we’ll throw it away and only keep the sec­ond part of the tuple with the usual snd. It’s in IO so we need to use liftM or fmap before we can apply snd. Com­bined with our pre­vi­ous Tag­soup code, we get:

import Text.HTML.TagSoup
import Network.Curl (curlGetString, URLString)

main :: IO ()
main = do html <- openURL "https://github.com/languages/Haskell/created"
          let links = linkify html
          print links

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

Spidering (the lazy way)

What’s the out­put of this?

["logo boring","https://github.com","/plans","/explore","/features","/blog",
    "/login?return_to=https://github.com/languages/Haskell/created",
    "/languages/Haskell","/explore","explore_main","/repositories","explore_repos","/languages",
    "selected","explore_languages","/timeline","explore_timeline","/search","code_search",
    "/tips","explore_tips","/languages/Haskell","/languages/Haskell/created",
    "selected","/languages/Haskell/updated", "/languages","/languages/ActionScript/created",
    "/languages/Ada/created","/languages/Arc/created","/languages/ASP/created",
    "/languages/Assembly/created",
    "/languages/Boo/created","/languages/C/created","/languages/C%23/created",
    "/languages/C++/created","/languages/Clojure/created",
    "/languages/CoffeeScript/created",
    "/languages/ColdFusion/created","/languages/Common%20Lisp/created","/languages/D/created",
    "/languages/Delphi/created","/languages/Duby/created",
    "/languages/Eiffel/created",
    "/languages/Emacs%20Lisp/created","/languages/Erlang/created","/languages/F%23/created",
    "/languages/Factor/created","/languages/FORTRAN/created",
    "/languages/Go/created",
    "/languages/Groovy/created","/languages/HaXe/created","/languages/Io/created",
    "/languages/Java/created","/languages/JavaScript/created",
    "/languages/Lua/created",
    "/languages/Max/MSP/created","/languages/Nu/created","/languages/Objective-C/created",
    "/languages/Objective-J/created","/languages/OCaml/created",
    "/languages/ooc/created",
    "/languages/Perl/created","/languages/PHP/created","/languages/Pure%20Data/created",
    "/languages/Python/created","/languages/R/created",
    "/languages/Racket/created",
    "/languages/Ruby/created","/languages/Scala/created","/languages/Scheme/created",
    "/languages/sclang/created","/languages/Self/created",
    "/languages/Shell/created",
    "/languages/Smalltalk/created","/languages/SuperCollider/created","/languages/Tcl/created",
    "/languages/Vala/created","/languages/Verilog/created",
    "/languages/VHDL/created",
    "/languages/VimL/created","/languages/Visual%20Basic/created","/languages/XQuery/created",
    "/brownnrl","/brownnrl/Real-World-Haskell","/joh",
    "/joh/tribot","/bjornbm","/bjornbm/publicstuff","/codemac","/codemac/yi","/poconnell93",
    "/poconnell93/chat","/jillianfu","/jillianfu/Angel","/jaspervdj","/jaspervdj/sup-host","/serras",
    "/serras/scion-ghc-7-requisites","/serras","/serras/scion","/iand675","/iand675/cgen","/shangaslammi",
    "/shangaslammi/haskeroids","/rukav","/rukav/ReplayTrace","/jaspervdj","/jaspervdj/wol","/tomlokhorst",
    "/tomlokhorst/wol","/bos","/bos/concurrent-barrier","/jkingry","/jkingry/projectEuler","/olshanskydr",
    "/olshanskydr/xml-enumerator","/lorenz","/lorenz/fypmaincode","/jaspervdj",
    "/jaspervdj/data-object-json","/jaspervdj","/jaspervdj/data-object-yaml",
    "/languages/Haskell/created?page=2","next","/languages/Haskell/created?page=3",
    "/languages/Haskell/created?page=4","/languages/Haskell/created?page=5",
    "/languages/Haskell/created?page=6","/languages/Haskell/created?page=7",
    "/languages/Haskell/created?page=8","/languages/Haskell/created?page=9",
    "/languages/Haskell/created?page=208","/languages/Haskell/created?page=209",
    "/languages/Haskell/created?page=2","l","next","http://www.rackspace.com","logo",
    "http://www.rackspace.com ","http://www.rackspacecloud.com","https://github.com/blog",
    "/login/multipass?to=http%3A%2F%2Fsupport.github.com","https://github.com/training",
    "http://jobs.github.com","http://shop.github.com",
    "https://github.com/contact","http://develop.github.com","http://status.github.com",
    "/site/terms","/site/privacy","https://github.com/security",
    "nofollow","?locale=de","nofollow","?locale=fr","nofollow","?locale=ja",
    "nofollow","?locale=pt-BR","nofollow","?locale=ru","nofollow","?locale=zh",
    "#","minibutton btn-forward js-all-locales","nofollow","?locale=en","nofollow","?locale=af",
    "nofollow","?locale=ca","nofollow","?locale=cs","nofollow","?locale=de",
    "nofollow","?locale=es","nofollow","?locale=fr","nofollow","?locale=hr",
    "nofollow","?locale=hu","nofollow","?locale=id","nofollow","?locale=it",
    "nofollow","?locale=ja","nofollow","?locale=nl","nofollow","?locale=no",
    "nofollow","?locale=pl","nofollow","?locale=pt-BR","nofollow","?locale=ru",
    "nofollow","?locale=sr","nofollow","?locale=sv","nofollow","?locale=zh",
    "#","js-see-all-keyboard-shortcuts"]

Quite a mouth­ful, but we can eas­ily fil­ter things down. “/languages/Haskell/created?page=3” is an exam­ple of a link to the next page list­ing Haskell repos­i­to­ries; pre­sum­ably the cur­rent page would be “?page=1”, and the high­est listed seems to be “/languages/Haskell/created?page=209”. The actual repos­i­to­ries look like “/jaspervdj/data-object-yaml”.

The reg­u­lar­ity of the “cre­ated” num­ber­ing sug­gests that we can avoid any actual spi­der­ing. Instead, we could just fig­ure out what the last page is, the high­est page, and then gen­er­ate all the page names in between because they fol­low a sim­ple scheme.

Assume we have the final num­ber, n, we already know we get the full list through [1..n]; then we want to prepend “languages/Haskell/created?page=”, but it’s a type error to sim­ply write map ("languages/Haskell/created?page="++) [1..n]. There is only one type­-­vari­able in (++) :: [a] -> [a] -> [a]. To con­vert the Inte­gers to a proper String, we do map show, so that gives us our gen­er­a­tor:

listPages :: [String]
listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [1..]

(This will throw a warn­ing using -Wall because GHC has to guess whether the 1 is an Int or Inte­ger. This can be qui­eted by writ­ing (1::Int) instead.)

But what is x? We don’t know the final, high­est, old­est page. We don’t know how much of our infi­nite lazy list to take. It’s easy enough to filter the list to get only the index: filter (isPrefixOf "/languages/Haskell/created?page=").

Then we call last, right? (Or some­thing like head . reverse if we did­n’t know last or if we did­n’t think to check the hits for [a] -> a). But if you look back at the orig­i­nal scrap­ing out­put, you see an exam­ple of how a sim­ple approach can go wrong; we read “/languages/Haskell/created?page=209” and then we read “/languages/Haskell/created?page=2”! 2 is less than 209, of course, and is the wrong answer. GitHub is not padding the num­bers to look like “cre­at­ed?­page=002”, so our sim­ple-­minded approach does­n’t work.

So we need to extract the num­ber. Easy enough: the pre­fix is sta­t­i­cally known and never changes, so we can hard­wire some crude pars­ing using drop: drop 32. How to turn the remain­ing String into an Int? Hope­fully one knows about read, but even here Hoogle will save our bacon if we think to look through the list of hits for String -> Intread turns up as hit #10 or #11. Then, now that we have turned our [String] into [In­t], we could sort it and take the last entry, or again go to the stan­dard library and use maximum (like read, it will turn up for [Int] -> Int, if not as highly ranked as one might hope). Tweak­ing the syn­tax a lit­tle, our final result is:

lastPage :: [String] -> Int
lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`)

If we did­n’t want to hard­wire this for Haskell, we’d prob­a­bly write the func­tion with an addi­tional para­me­ter and replace the Int with a run­time cal­cu­la­tion of what to remove:

lastPageGeneric :: String -> [String] -> Int
lastPageGeneric lang = maximum . map (read . drop (length lang)) . filter (lang `isPrefixOf`)

So let’s put what we have togeth­er. The pro­gram can down­load an ini­tial index page, parse it, find the name of the last index page, and gen­er­ate the URLs of all index pages, and print those out (to prove that it all work­s):

import Data.List (isPrefixOf)
import Network.Curl (curlGetString, URLString)
import Text.HTML.TagSoup

main :: IO ()
main = do html <- openURL "https://github.com/languages/Haskell/created"
          let lst = lastPage $ linkify html
          let indxPgs = take lst listPages
          print indxPgs

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

lastPage :: [String] -> Int
lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`)

listPages :: [String]
listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [(1::Int)..]

So where were we? We had a [String] (in a vari­able named indxPgs) which rep­re­sents all the index pages. We can get the HTML source of each page just by reusing openURL (it works on the first one, so it stands to rea­son it’d work on all index pages), which is triv­ial by this point: mapM openURL indxPgs.

Filtering repositories

In the Tag­Soup result, we saw the addresses of the repos­i­to­ries listed on the first index page:

["/brownnrl","/brownnrl/Real-World-Haskell","/joh","/joh/tribot","/bjornbm","/bjornbm/publicstuff",
    "/codemac","/codemac/yi","/poconnell93","/poconnell93/chat","/jillianfu","/jillianfu/Angel",
    "/jaspervdj","/jaspervdj/sup-host","/serras","/serras/scion-ghc-7-requisites","/serras","/serras/scion",
    "/iand675","/iand675/cgen","/shangaslammi","/shangaslammi/haskeroids","/rukav","/rukav/ReplayTrace",
    "/jaspervdj","/jaspervdj/wol","/tomlokhorst","/tomlokhorst/wol","/bos","/bos/concurrent-barrier",
    "/jkingry","/jkingry/projectEuler","/olshanskydr","/olshanskydr/xml-enumerator","/lorenz",
    "/lorenz/fypmaincode","/jaspervdj","/jaspervdj/data-object-json","/jaspervdj",
    "/jaspervdj/data-object-yaml"]

With­out look­ing at the ren­dered page in our browser, it’s obvi­ous that GitHub is link­ing first to what­ever user owns or cre­ated the repos­i­to­ry, and then link­ing to the repos­i­tory itself. We don’t want the users, but the repos­i­to­ries. For­tu­nate­ly, it’s equally obvi­ous that this is true: no user page has two for­ward-s­lashes in it, while all repos­i­tory pages have two for­ward-s­lashes in it.

So we want to count the for­ward-s­lashes and keep every address with exactly 2 for­ward-s­lash­es. The type for our func­tion takes a list, a pos­si­ble entry in that list, and returns a count. This is easy to do with prim­i­tive recur­sion and an accu­mu­la­tor, or per­haps length com­bined with filter; but the base library already has func­tions for a -> [a] -> Int. elemIndex annoy­ingly returns a Maybe Int, so we’ll use elemIndices instead and call length on its out­put: length (elemIndices '/' x) == 2.

This is not quite right. If we run this on the orig­i­nal parsed out­put, we get

["https://github.com","/languages/Haskell","/languages/Haskell","/plategreaves/unordered-containers",
    "/vincenthz/hs-tls-extra","/aculich/fix-symbols-gitit",
    "/sphynx/euler-hs","/DRMacIver/unordered-containers","/hamishmack/yesod-slides","/GNUManiacs/hoppla",
    "/DRMacIver/hs-rank-aggregation","/naota/hackage-autoebuild",
    "/magthe/hsini","/dagit/gnuplot-test","/imbaczek/HBPoker","/sergeyastanin/simpleea","/cbaatz/hamu8080",
    "/aristidb/xml-enumerator",
    "/elliottt/value-supply","/gnumaniacs-org/hoppla","/emillon/tyson","/quelgar/hifive",
    "/quelgar/haskell-websockets","http://www.rackspace.com",
    "http://www.rackspace.com ","http://www.rackspacecloud.com",
    "/login/multipass?to=http%3A%2F%2Fsupport.github.com","http://jobs.github.com",
    "http://shop.github.com","http://develop.github.com",
    "http://status.github.com","/site/terms","/site/privacy"]

It does­n’t look like we mis­tak­enly omit­ted a repos­i­to­ry, but it does look like we mis­tak­enly included things we should not have. We need to fil­ter out any­thing begin­ning with a “http://”, “https://”, “/site/”, “/languages/”, or “/login/”.6

We could call filter mul­ti­ple times, or use a tricky foldr to accu­mu­late only results which don’t match any of the items in our list ["/languages/", "/login/", "/site/", "http://", "https://"]. But I already wrote to this prob­lem back in the orig­i­nal WP RSS archive-bot where I noticed that my orig­i­nal giant filter call could be replaced by a much more ele­gant use of any

 where  uniq :: [String] -> [String]
        uniq = filter (\x ->not $ any (flip isInfixOf x) exceptions)

        exceptions :: [String]
        exceptions = ["wikimediafoundation", "http://www.mediawiki.org/", "wikipedia",
                      "&curid=", "index.php?title=", "&action="]

In our case, we replace isInfixOf with isPrefixOf, and we have dif­fer­ent con­stants defined in exceptions. To put it all together into a new fil­ter­ing func­tion, we have:

repos :: String -> [String]
repos = uniq . linkify
  where  uniq :: [String] -> [String]
         uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions)
         exceptions :: [String]
         exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"]
         count :: String -> Bool
         count x = length (elemIndices '/' x) == 2

Our new min­i­mal­ist pro­gram, which will test out repos:

import Data.List (elemIndices, isPrefixOf)
import Network.Curl (curlGetString, URLString)
import Text.HTML.TagSoup

main :: IO ()
main = do html <- openURL "https://github.com/languages/Haskell/created"
          print $ repos html

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

repos :: String -> [String]
repos = uniq . linkify
  where  uniq :: [String] -> [String]
         uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions)
         exceptions :: [String]
         exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"]
         count :: String -> Bool
         count x = length (elemIndices '/' x) == 2

The out­put:

["/plategreaves/unordered-containers","/vincenthz/hs-tls-extra","/aculich/fix-symbols-gitit",
    "/sphynx/euler-hs","/DRMacIver/unordered-containers","/hamishmack/yesod-slides",
    "/GNUManiacs/hoppla","/DRMacIver/hs-rank-aggregation","/naota/hackage-autoebuild","/magthe/hsini",
    "/dagit/gnuplot-test","/imbaczek/HBPoker",
    "/sergeyastanin/simpleea","/cbaatz/hamu808a0","/aristidb/xml-enumerator","/elliottt/value-supply",
    "/gnumaniacs-org/hoppla","/emillon/tyson",
    "/quelgar/hifive","/quelgar/haskell-websockets"]

Shelling out to git

That leaves the ‘shell out to git’ func­tion­al­i­ty. We could try steal­ing the spawn (call out to /bin/sh) code from XMon­ad, but the point of spawn is that it away com­pletely from our script, which will com­pletely screw up our desired lack of par­al­lelism.7 I ulti­mately wound up using a func­tion from System.Process, readProcessWithExitCode. (Why readProcessWithExitCode and not readProcess? Because if a direc­tory already exists, git/readProcess throws an excep­tion which kills the scrip­t!) This will work:

shellToGit :: String -> IO ()
shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u] ""
                  print y

In ret­ro­spect, it might have been a bet­ter idea to try to use runCommand or System.Cmd. Alter­na­tive­ly, we could use the same shelling out func­tion­al­ity from the orig­i­nal patch-­tag.­com script:

mapM_ (\x -> runProcess "darcs" ["get", "--lazy", "http://patch-tag.com"++x]
                          Nothing Nothing Nothing Nothing Nothing) targets

Which could be rewrit­ten for us (sans log­ging) as

shellToGit :: String -> IO ()
shellToGit u = runProcess "git" ["clone", u] Nothing Nothing Nothing Nothing Nothing >> return ()
-- We could replace `return ()` with Control.Monad.void to drop `IO ProcessHandle` result

Now it’s easy to fill in our 2 miss­ing lines:

          ...
          repourls <- mapM getRepos indxPgs
          let gitURLs = map gitify $ concat repourls
          mapM_ shellToGit gitURLs

(The concat is there because getRepos gave us a [String] for each String, and then we ran it on a [String]—so our result is [[String]]! But we don’t care about pre­serv­ing the infor­ma­tion about where each String came from, so we smush it down to a sin­gle list. Strictly speak­ing, we did­n’t need to do print y in shellToGit, but while devel­op­ing, it’s a good idea to have some sort of log­ging—get a sense of what the script is doing. And once you are print­ing at all, you can sort the list of repos­i­tory URLs to down­load them in order by user.)

Unique repositories

There is one sub­tlety here worth not­ing that our script is run­ning rough-shod over. Each URL we down­load is unique, because user­names are unique on GitHub and each URL is formed from a “/username/reponame” pair. But each down­loaded repos­i­tory is not unique, because git will shuck off the user­name and cre­ate a direc­tory with just the repos­i­tory name—“/john/bar” and “/jack/bar” will clash, and if you down­load in that order, the bar repos­i­tory will be John’s repos­i­tory and not Jack’s repos­i­to­ry. Git will error out the sec­ond time, but this error is ignored by the shelling code. The solu­tion would be to tell git to clone to a non-de­fault but unique direc­tory (for exam­ple, one could reuse the “/username/reponame” and then one’s tar­get direc­tory would be neatly pop­u­lated by sev­eral hun­dred direc­to­ries named after users, each pop­u­lated by a few repos­i­to­ries with non-u­nique names). If we went with the per-user approach, our new ver­sion would look like this:

shellToGit :: String -> IO ()
shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u, drop 19 u] ""
                  print y

Why the drop 19 u? Well, u is the fully qual­i­fied URL, eg. “https://www.github.com/sergeyastanin/simpleea”. Obvi­ously we don’t want to exe­cute git clone "https://www.github.com/sergeyastanin/simpleea" "https://www.github.com/sergeyastanin/simpleea" (even though that’d be per­fectly valid), because it makes for ugly fold­ers. But drop 19 "https://www.github.com/sergeyastanin/simpleea" turns into “sergeyastanin/simpleea”, giv­ing us the right local direc­tory name with no pre­fixed slash.

Or you just pass in the orig­i­nal “/username/reponame” and use drop 1 on that instead. (Ei­ther way, you need to do addi­tional work. Might as well just use drop 19.)

One final note: many of the URLs end in “.git”. If we dis­liked this, then we could enhance the drop 19 with System.FilePath.dropExtension: dropExtension $ drop 19 u.

The script

The final pro­gram, clean of -Wall or hlint warn­ings:

import Data.List (elemIndices, isPrefixOf, sort)
import Network.Curl (curlGetString, URLString)
import System.FilePath (dropExtension)
import System.Process (readProcessWithExitCode)
import Text.HTML.TagSoup

main :: IO ()
main = do html <- openURL "https://github.com/languages/Haskell/created"
          let lst = lastPage $ linkify html
          let indxPgs = take lst listPages
          repourls <- mapM getRepos indxPgs
          let gitURLs = map gitify $ sort $ concat repourls
          mapM_ shellToGit gitURLs

openURL :: URLString -> IO String
openURL target = fmap snd $ curlGetString target []

linkify :: String -> [String]
linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]

lastPage :: [String] -> Int
lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`)

listPages :: [String]
listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [(1::Int)..]

repos :: String -> [String]
repos = uniq . linkify
  where  uniq :: [String] -> [String]
         uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions)
         exceptions :: [String]
         exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"]
         count :: String -> Bool
         count x = length (elemIndices '/' x) == 2

getRepos :: String -> IO [String]
getRepos = fmap repos . openURL

gitify :: String -> String
gitify x = "https://github.com" ++ x ++ ".git"

shellToGit :: String -> IO ()
shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u, dropExtension $ drop 19 u] ""
                  print y

This, or a ver­sion of it, works well. But I cau­tion peo­ple from mis­-us­ing it! There are a lot of repos­i­to­ries on GitHub; please don’t go run­ning this care­less­ly. It will pull down 4-12 giga­bytes of data. GitHub is a good FLOSS-friendly busi­ness by all accounts, and does­n’t deserve peo­ple wast­ing its band­width & money if they are not even going to keep what they down­loaded.

The script golfed

For kicks, let’s see what a short­er, more unmain­tain­able and unread­able, ver­sion looks like (in the best script­ing lan­guage tra­di­tion):

import Data.List (elemIndices, isPrefixOf, sort)
import Network.Curl (curlGetString)
import System.FilePath (dropExtension)
import System.Process (readProcessWithExitCode)
import Text.HTML.TagSoup

main = do html <- openURL "https://github.com/languages/Haskell/created"
          let i = take (lastPage $ linkify html) $
                   map (("https://github.com/languages/Haskell/created?page="++) . show) [1..]
          repourls <- mapM (fmap (uniq . linkify) . openURL) i
          mapM_ shellToGit $ map (\x -> "https://github.com" ++ x ++ ".git") $ sort $ concat repourls
       where openURL target = fmap snd $ curlGetString target []
             linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts]
             lastPage = maximum . map (read . drop 32) .
                         filter ("/languages/Haskell/created?page=" `isPrefixOf`)
             uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x)
                     ["/languages/", "/login/", "/site/", "http://", "https://"])
             count x = length (elemIndices '/' x) == 2
             shellToGit u = do { (_,y,_) <- readProcessWithExitCode "git"
                                             ["clone", u, dropExtension $ drop 19 u] ""; print y }

14 lines of code isn’t too bad, espe­cially con­sid­er­ing that Haskell is not usu­ally con­sid­er­ing a lan­guage suited for script­ing or scrap­ing pur­poses like this. Nor do I see any obvi­ous miss­ing abstrac­tions—count is a func­tion that might be use­ful in Data.List, and openURL is some­thing that the Curl bind­ing could pro­vide on its own, but every­thing else looks pretty nec­es­sary.

Exercises for the reader

  1. Once one has all those repos­i­to­ries, how does one keep them up to date? The rel­e­vant com­mand is git pull. How would one run this on all the repos­i­to­ries? In shell script? Using find? From a crontab?8
  2. In the pre­vi­ous script, a num­ber of short­-­cuts were taken which ren­der it Haskel­l-spe­cif­ic. Iden­tify and rem­edy them, turn­ing this script into a gen­er­al-pur­pose script for down­load­ing any lan­guage for which GitHub has a cat­e­go­ry. (The lan­guage name can be accessed by read­ing an argu­ment to the script by stan­dard func­tions like getArgs.)

  1. This page assumes a basic under­stand­ing of how ver­sion con­trol pro­grams work, Haskell syn­tax, and the Haskell stan­dard library. For those not au courant, repos­i­to­ries are basi­cally a col­lec­tion of log­i­cally related files and a detailed his­tory of the mod­i­fi­ca­tions that built them up from noth­ing.↩︎

  2. From my ini­tial email:

    I’d think it [Control.Monad.void] [would] be use­ful for more than just me. Agda is lousy with calls to >> return (); and then there’s ZMa­chine, arrayref, whim, the bar­racuda pack­ages, bina­ry, bnfc, bud­dha, bytestring, c2hs, cabal, chessli­brary, comas, con­jure, curl, darcs, darc­s-bench­mark, dbus-haskell, ddc, dephd, derive, dhs, drift, easyvi­sion, ehc, file­store, folkung, geni, geordi, gtk2hs, gnu­plot, gin­su, halfs, happ­stack, haske­line, hback, hbeat… You get the pic­ture.

    ↩︎
  3. I occa­sion­ally also use Haskell scripts based on haskel­l-s­r­c-exts.↩︎

  4. Besides curl, there is the http-wget wrap­per, and the http-enu­mer­a­tor pack­age claims to natively sup­port HTTPS. I have not tried them.↩︎

  5. What would we do? Keep retry­ing? There are going to be tons of errors in this script any­way, from repos­i­to­ries incor­rectly iden­ti­fied as Haskel­l-re­lated to repos­i­tory dupli­cates to tran­sient net­work errors, that we gain a great deal of com­plex­ity from retry­ing and may make the script less reli­able.↩︎

  6. Oh no—a black­list! This should make us unhap­py, because as com­puter secu­rity has taught us, black­lists fall out of date quickly or were never cor­rect to begin with. Much bet­ter to whitelist, but how can we do that? Peo­ple could name their repos­i­to­ries any damn thing, and pick any accursed user­name; for all our code knows, ‘/site/terms’ is the repos­i­to­ry—per­haps the user ‘site’ main­tains some sort a nat­ural lan­guage library called ‘terms’.↩︎

  7. A quick point: in the pre­vi­ous scripts, I went to some effort to get greater par­al­lelism, but in this case, we don’t want to ham­mer GitHub with a few thou­sand simul­ta­ne­ous git clone invo­ca­tions; Haskell repos­i­to­ries are cre­ated rarely enough that we can afford to be friendly and only down­load one repos­i­tory at a time.↩︎

  8. Exam­ple cron answer: @weekly find ~/bin -type d -name ".git" -execdir nice git pull \;↩︎