--- description: Scraping and downloading Haskell-related repositories from GitHub tags: Haskell, archiving created: 20 Mar 2011 status: finished confidence: highly likely importance: 4 ... > This Haskell tutorial was written in early March 2011, and while the below code worked then, it may not work with Github or the necessary Haskell libraries now. If you are interested in downloading from GitHub I suggest looking into the [GitHub API](#comment-1099296212) or the Haskell [`github`](https://github.com/fpco/github) library. # Why download? Along the lines of [Archiving URLs](/Archiving-URLs), I like to keep copies of [Haskell](!Wikipedia "Haskell (programming language)")-related source-code [repositories](!Wikipedia "Revision control")^[This page assumes a basic understanding of how version control programs work, Haskell syntax, and the Haskell standard library. For those not _au courant_, repositories are basically a collection of logically related files and a detailed history of the modifications that built them up from nothing.] because the files & history might come in handy on occasion, and because having a large collection of repositories lets me search them for random purposes. (For example, part of my lobbying for [Control.Monad.void](!Hoogle)[^void] was based on producing a list of dozens of source files which rewrote that particular idiom, and I have been able [to](http://web.archive.org/web/20130623103613/http://www.haskell.org/pipermail/libraries/2011-April/016288.html "Proposal: add Int indexing functions to Data.Set") [usefully](http://web.archive.org/web/20130128084734/http://www.haskell.org/pipermail/haskell-cafe/2011-February/089614.html "[Haskell-cafe] ANN: unordered-containers - a new, faster hashing-based containers library") [comment](http://web.archive.org/web/20130128084821/http://www.haskell.org/pipermail/haskell-cafe/2008-January/038948.html "[Haskell-cafe] RE: [Haskell] Announce: Yi 0.3") [&](http://web.archive.org/web/20130128084824/http://www.haskell.org/pipermail/haskell-cafe/2011-May/091660.html "[Haskell-cafe] Usage of rewrite rule specialization in Hackage") [judge](/haskell/Summer-of-Code) based on crude [statistics](http://web.archive.org/web/20130128084846/http://www.haskell.org/pipermail/haskell-cafe/2011-May/091663.html "[Haskell-cafe] Haskell statistics for use of TypeFamilies and FunctionalDependencies") gathered by [grepping](!Wikipedia)^[I occasionally also use Haskell scripts based on [haskell-src-exts](!Hackage).] through my hundreds of repositories.) [^void]: From my [initial email](http://web.archive.org/web/20130128084740/http://www.haskell.org/pipermail/libraries/2009-June/011880.html "Adding an ignore function to Control.Monad"): > I'd think it [`Control.Monad.void`] [would] be useful for more than just me. Agda is lousy with calls to `>> return ()`; and then there's ZMachine, arrayref, whim, the barracuda packages, binary, bnfc, buddha, bytestring, c2hs, cabal, chesslibrary, comas, conjure, curl, darcs, darcs-benchmark, dbus-haskell, ddc, dephd, derive, dhs, drift, easyvision, ehc, filestore, folkung, geni, geordi, gtk2hs, gnuplot, ginsu, halfs, happstack, haskeline, hback, hbeat... You get the picture. [Previously](http://blog.patch-tag.com/2010/03/13/mirroring-patch-tag/) I wrote a simple script to download the repositories of the source repository hosting site [Patch-tag.com](http://www.patch-tag.com/). Patch-tag specializes in hosting [Darcs](!Wikipedia) repositories (usually Haskell-related). [GitHub](!Wikipedia) is a much larger & more popular hosting site, and though it does not support Darcs but [git](!Wikipedia "Git (software)") (as the name indicates), it is so popular that it still hosts a great deal of Haskell. I've downloaded a few repositories out of curiosity or because I was working on the contents of the repository (eg. [gitit](https://github.com/jgm/gitit)), but there are too many to download manually. I needed a script. Patch-tag was nice enough to supply a URL which provided exactly the URLs I needed, but I couldn't expect such personalized support from GitHub. GitHub does supply an API of sorts for developers and hobbyists, said API provides no obvious way to get what I want: 'URLs for all Haskell-related repos'. So, [scraping](!Wikipedia "Data scraping") it is - I'd write a script to munge some GitHub HTML and get the URLs I want *that* way. # Archiving GitHub ## Parsing pages The closest I can get to a target URL is . We'll be parsing that. The first thing to do is to steal [TagSoup](http://community.haskell.org/~ndm/tagsoup/) code from my [previous scrapers](/haskell/Wikipedia-Archive-Bot#parsing-html), so our very crudest version looks like this: ~~~{.Haskell} import Text.HTML.TagSoup import Text.HTML.Download (openURL) main = do html <- openURL "https://github.com/languages/Haskell/created" let links = linkify html print links linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts] ~~~ ## Downloading pages (the lazy way) We run it and it throws an exception! `*** Exception: getAddrInfo: does not exist (No address associated with hostname)` Oops. We got all wrapped up in parsing the HTML we forgot to make sure that downloading worked in the first place. Well, we're lazy programmers, so now, on demand, we'll investigate that problem. The exception thrown sounds like a problem with the `openURL` call - '[hostname](!Wikipedia)' is a networking term, not a parsing or printing term. So we try running just `openURL "https://github.com/languages/Haskell/created"` - same error. Not helpful. We try a different implementation of `openURL`, mentioned in the [other scraping script](/haskell/Wikipedia-RSS-Archive-Bot#rewriting-network-code): ~~~{.Haskell} import Network.HTTP (getRequest, simpleHTTP) openURL = simpleHTTP . getRequest ~~~ Calling that again, we see: ~~~ > openURL "https://github.com/languages/Haskell/created" Loading package HTTP-4000.1.1 ... linking ... done. Right HTTP/1.1 301 Moved Permanently Server: nginx/0.7.67 Date: Tue, 15 Mar 2011 22:54:31 GMT Content-Type: text/html Content-Length: 185 Connection: close Location: https://github.com/languages/Haskell/created ~~~ Oh dear. It seems that the [`HTTP`](!Hackage "HTTP") package just won't handle HTTP*S*; nor does the description mention HTTPS nor any of the module names seem connected. Best to give up entirely on it. If we google 'Haskell https', one of the first 20 hits happens to be a [Stack Overflow](!Wikipedia) question/page which sounds promising: ["Haskell Network.Browser HTTPS Connection"](http://stackoverflow.com/questions/3988115/haskell-network-browser-https-connection). The [one answer](http://stackoverflow.com/questions/3988115/haskell-network-browser-https-connection/3988829#3988829) says to simply use the [Haskell binding](!Hackage "curl") to [curl](!Wikipedia "cURL"). Well, fine. I already had that installed because Darcs uses the binding for downloads. I'll use that package.^[Besides `curl`, there is the [http-wget](!Hackage) wrapper, and the [http-enumerator](!Hackage) package claims to natively support HTTPS. I have not tried them.] We go to the [top level module](http://hackage.haskell.org/packages/archive/curl/1.3.6/doc/html/Network-Curl.html) hoping for an easy download. Scrolling down, one's eye is caught by a [curlGetString](http://hackage.haskell.org/packages/archive/curl/1.3.6/doc/html/Network-Curl.html#v:curlGetString), which while not necessarily a promising name, does have an interesting type: `URLString -> [CurlOption] -> IO (CurlCode, String)`. Note especially the return value - from past experience with the HTML package, one would give a good chance that the `URLString` is just a type synonym for a URL string and the String return just the HTML source we want. What `CurlOption` might be, I have no idea, but let's try simply omitting them all. So we load the module in GHCi (`:module + Network.Curl`) and see what `curlGetString "https://github.com/languages/Haskell/created" []` does: ~~~{.Haskell} (CurlOK," Recently Created Haskell Repositories - GitHub ...") ~~~ Great! As they say, 'try the simplest possible thing that could possibly work', and this seems to. We don't really care about the exit code, since this is a hacky script^[What would we do? Keep retrying? There are going to be tons of errors in this script anyway, from repositories incorrectly identified as Haskell-related to repository duplicates to transient network errors, that we gain a great deal of complexity from retrying and may make the script *less* reliable.]; we'll throw it away and only keep the second part of the tuple with the usual `snd`. It's in IO so we need to use `liftM` or `fmap` before we can apply `snd`. Combined with our previous Tagsoup code, we get: ~~~{.Haskell} import Text.HTML.TagSoup import Network.Curl (curlGetString, URLString) main :: IO () main = do html <- openURL "https://github.com/languages/Haskell/created" let links = linkify html print links openURL :: URLString -> IO String openURL target = fmap snd $ curlGetString target [] linkify :: String -> [String] linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts] ~~~ ## Spidering (the lazy way) What's the output of this? ~~~{.Haskell} ["logo boring","https://github.com","/plans","/explore","/features","/blog", "/login?return_to=https://github.com/languages/Haskell/created", "/languages/Haskell","/explore","explore_main","/repositories","explore_repos","/languages", "selected","explore_languages","/timeline","explore_timeline","/search","code_search", "/tips","explore_tips","/languages/Haskell","/languages/Haskell/created", "selected","/languages/Haskell/updated", "/languages","/languages/ActionScript/created", "/languages/Ada/created","/languages/Arc/created","/languages/ASP/created", "/languages/Assembly/created", "/languages/Boo/created","/languages/C/created","/languages/C%23/created", "/languages/C++/created","/languages/Clojure/created", "/languages/CoffeeScript/created", "/languages/ColdFusion/created","/languages/Common%20Lisp/created","/languages/D/created", "/languages/Delphi/created","/languages/Duby/created", "/languages/Eiffel/created", "/languages/Emacs%20Lisp/created","/languages/Erlang/created","/languages/F%23/created", "/languages/Factor/created","/languages/FORTRAN/created", "/languages/Go/created", "/languages/Groovy/created","/languages/HaXe/created","/languages/Io/created", "/languages/Java/created","/languages/JavaScript/created", "/languages/Lua/created", "/languages/Max/MSP/created","/languages/Nu/created","/languages/Objective-C/created", "/languages/Objective-J/created","/languages/OCaml/created", "/languages/ooc/created", "/languages/Perl/created","/languages/PHP/created","/languages/Pure%20Data/created", "/languages/Python/created","/languages/R/created", "/languages/Racket/created", "/languages/Ruby/created","/languages/Scala/created","/languages/Scheme/created", "/languages/sclang/created","/languages/Self/created", "/languages/Shell/created", "/languages/Smalltalk/created","/languages/SuperCollider/created","/languages/Tcl/created", "/languages/Vala/created","/languages/Verilog/created", "/languages/VHDL/created", "/languages/VimL/created","/languages/Visual%20Basic/created","/languages/XQuery/created", "/brownnrl","/brownnrl/Real-World-Haskell","/joh", "/joh/tribot","/bjornbm","/bjornbm/publicstuff","/codemac","/codemac/yi","/poconnell93", "/poconnell93/chat","/jillianfu","/jillianfu/Angel","/jaspervdj","/jaspervdj/sup-host","/serras", "/serras/scion-ghc-7-requisites","/serras","/serras/scion","/iand675","/iand675/cgen","/shangaslammi", "/shangaslammi/haskeroids","/rukav","/rukav/ReplayTrace","/jaspervdj","/jaspervdj/wol","/tomlokhorst", "/tomlokhorst/wol","/bos","/bos/concurrent-barrier","/jkingry","/jkingry/projectEuler","/olshanskydr", "/olshanskydr/xml-enumerator","/lorenz","/lorenz/fypmaincode","/jaspervdj", "/jaspervdj/data-object-json","/jaspervdj","/jaspervdj/data-object-yaml", "/languages/Haskell/created?page=2","next","/languages/Haskell/created?page=3", "/languages/Haskell/created?page=4","/languages/Haskell/created?page=5", "/languages/Haskell/created?page=6","/languages/Haskell/created?page=7", "/languages/Haskell/created?page=8","/languages/Haskell/created?page=9", "/languages/Haskell/created?page=208","/languages/Haskell/created?page=209", "/languages/Haskell/created?page=2","l","next","http://www.rackspace.com","logo", "http://www.rackspace.com ","http://www.rackspacecloud.com","https://github.com/blog", "/login/multipass?to=http%3A%2F%2Fsupport.github.com","https://github.com/training", "http://jobs.github.com","http://shop.github.com", "https://github.com/contact","http://develop.github.com","http://status.github.com", "/site/terms","/site/privacy","https://github.com/security", "nofollow","?locale=de","nofollow","?locale=fr","nofollow","?locale=ja", "nofollow","?locale=pt-BR","nofollow","?locale=ru","nofollow","?locale=zh", "#","minibutton btn-forward js-all-locales","nofollow","?locale=en","nofollow","?locale=af", "nofollow","?locale=ca","nofollow","?locale=cs","nofollow","?locale=de", "nofollow","?locale=es","nofollow","?locale=fr","nofollow","?locale=hr", "nofollow","?locale=hu","nofollow","?locale=id","nofollow","?locale=it", "nofollow","?locale=ja","nofollow","?locale=nl","nofollow","?locale=no", "nofollow","?locale=pl","nofollow","?locale=pt-BR","nofollow","?locale=ru", "nofollow","?locale=sr","nofollow","?locale=sv","nofollow","?locale=zh", "#","js-see-all-keyboard-shortcuts"] ~~~ Quite a mouthful, but we can easily filter things down. "/languages/Haskell/created?page=3" is an example of a link to the next page listing Haskell repositories; presumably the current page would be "?page=1", and the highest listed seems to be "/languages/Haskell/created?page=209". The actual repositories look like "/jaspervdj/data-object-yaml". The regularity of the "created" numbering suggests that we can avoid any actual spidering. Instead, we could just figure out what the last page is, the highest page, and then generate all the page names in between because they follow a simple scheme. Assume we have the final number, _n_, we already know we get the full list through `[1..n]`; then we want to prepend "languages/Haskell/created?page=", but it's a type error to simply write `map ("languages/Haskell/created?page="++) [1..n]`. There is only one type-variable in `(++) :: [a] -> [a] -> [a]`. To convert the Integers to a proper String, we do `map show`, so that gives us our generator: ~~~{.Haskell} listPages :: [String] listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [1..] ~~~ (This will throw a warning using `-Wall` because GHC has to guess whether the 1 is an Int or Integer. This can be quieted by writing `(1::Int)` instead.) But what is `x`? We don't know the final, highest, oldest page. We don't know how much of our infinite lazy list to `take`. It's easy enough to [filter](!Hoogle) the list to get only the index: `filter (isPrefixOf "/languages/Haskell/created?page=")`. Then we call [last](!Hoogle), right? (Or something like `head . reverse` if we didn't know `last` or if we didn't think to check the hits for [[a] -> a](!Hoogle)). But if you look back at the original scraping output, you see an example of how a simple approach can go wrong; we read "/languages/Haskell/created?page=209" and then we read "/languages/Haskell/created?page=2"! 2 is less than 209, of course, and is the wrong answer. GitHub is not padding the numbers to look like "created?page=002", so our simple-minded approach doesn't work. So we need to extract the number. Easy enough: the prefix is statically known and never changes, so we can hardwire some crude parsing using [drop](!Hoogle): `drop 32`. How to turn the remaining String into an Int? Hopefully one knows about [read](!Hoogle), but even here Hoogle will save our bacon if we think to look through the list of hits for [String -> Int](!Hoogle) - `read` turns up as hit #10 or #11. *Then*, now that we have turned our [String] into [Int], we could sort it and take the last entry, or again go to the standard library and use [maximum](!Hoogle) (like `read`, it will turn up for [[Int] -> Int](!Hoogle), if not as highly ranked as one might hope). Tweaking the syntax a little, our final result is: ~~~{.Haskell} lastPage :: [String] -> Int lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`) ~~~ If we didn't want to hardwire this for Haskell, we'd probably write the function with an additional parameter and replace the Int with a runtime calculation of what to remove: ~~~{.Haskell} lastPageGeneric :: String -> [String] -> Int lastPageGeneric lang = maximum . map (read . drop (length lang)) . filter (lang `isPrefixOf`) ~~~ So let's put what we have together. The program can download an initial index page, parse it, find the name of the last index page, and generate the URLs of all index pages, and print those out (to prove that it all works): ~~~{.Haskell} import Data.List (isPrefixOf) import Network.Curl (curlGetString, URLString) import Text.HTML.TagSoup main :: IO () main = do html <- openURL "https://github.com/languages/Haskell/created" let lst = lastPage $ linkify html let indxPgs = take lst listPages print indxPgs openURL :: URLString -> IO String openURL target = fmap snd $ curlGetString target [] linkify :: String -> [String] linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts] lastPage :: [String] -> Int lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`) listPages :: [String] listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [(1::Int)..] ~~~ So where were we? We had a `[String]` (in a variable named `indxPgs`) which represents all the index pages. We can get the HTML source of each page just by reusing `openURL` (it works on the first one, so it stands to reason it'd work on all index pages), which is trivial by this point: `mapM openURL indxPgs`. ## Filtering repositories In the TagSoup result, we saw the addresses of the repositories listed on the first index page: ~~~{.Haskell} ["/brownnrl","/brownnrl/Real-World-Haskell","/joh","/joh/tribot","/bjornbm","/bjornbm/publicstuff", "/codemac","/codemac/yi","/poconnell93","/poconnell93/chat","/jillianfu","/jillianfu/Angel", "/jaspervdj","/jaspervdj/sup-host","/serras","/serras/scion-ghc-7-requisites","/serras","/serras/scion", "/iand675","/iand675/cgen","/shangaslammi","/shangaslammi/haskeroids","/rukav","/rukav/ReplayTrace", "/jaspervdj","/jaspervdj/wol","/tomlokhorst","/tomlokhorst/wol","/bos","/bos/concurrent-barrier", "/jkingry","/jkingry/projectEuler","/olshanskydr","/olshanskydr/xml-enumerator","/lorenz", "/lorenz/fypmaincode","/jaspervdj","/jaspervdj/data-object-json","/jaspervdj", "/jaspervdj/data-object-yaml"] ~~~ Without looking at the rendered page in our browser, it's obvious that GitHub is linking first to whatever user owns or created the repository, and then linking to the repository itself. We don't want the users, but the repositories. Fortunately, it's equally obvious that this is true: no user page has two forward-slashes in it, while all repository pages have two forward-slashes in it. So we want to count the forward-slashes and keep every address with exactly 2 forward-slashes. The type for our function takes a list, a possible entry in that list, and returns a count. This is easy to do with primitive recursion and an accumulator, or perhaps `length` combined with `filter`; but the base library already has functions for [a -> [a] -> Int](!Hoogle). [elemIndex](!Hoogle) annoyingly returns a `Maybe Int`, so we'll use [elemIndices](!Hoogle) instead and call `length` on its output: `length (elemIndices '/' x) == 2`. This is not quite right. If we run this on the original parsed output, we get ~~~{.Haskell} ["https://github.com","/languages/Haskell","/languages/Haskell","/plategreaves/unordered-containers", "/vincenthz/hs-tls-extra","/aculich/fix-symbols-gitit", "/sphynx/euler-hs","/DRMacIver/unordered-containers","/hamishmack/yesod-slides","/GNUManiacs/hoppla", "/DRMacIver/hs-rank-aggregation","/naota/hackage-autoebuild", "/magthe/hsini","/dagit/gnuplot-test","/imbaczek/HBPoker","/sergeyastanin/simpleea","/cbaatz/hamu8080", "/aristidb/xml-enumerator", "/elliottt/value-supply","/gnumaniacs-org/hoppla","/emillon/tyson","/quelgar/hifive", "/quelgar/haskell-websockets","http://www.rackspace.com", "http://www.rackspace.com ","http://www.rackspacecloud.com", "/login/multipass?to=http%3A%2F%2Fsupport.github.com","http://jobs.github.com", "http://shop.github.com","http://develop.github.com", "http://status.github.com","/site/terms","/site/privacy"] ~~~ It doesn't look like we mistakenly omitted a repository, but it does look like we mistakenly included things we should not have. We need to filter out anything beginning with a "http://", "https://", "/site/", "/languages/", or "/login/".^[Oh no - a blacklist! This should make us unhappy, because as computer security has taught us, blacklists fall out of date quickly or were never correct to begin with. Much better to whitelist, but how can we do that? People could name their repositories any damn thing, and pick any accursed username; for all our code knows, '/site/terms' *is* the repository - perhaps the user 'site' maintains some sort a natural language library called 'terms'.] We could call `filter` multiple times, or use a tricky `foldr` to accumulate only results which don't match any of the items in our list `["/languages/", "/login/", "/site/", "http://", "https://"]`. But I already wrote [the solution](/haskell/Wikipedia-RSS-Archive-Bot) to this problem back in the original WP RSS archive-bot where I noticed that my [original giant `filter` call](/haskell/Wikipedia-Archive-Bot#duplicate-urls) could be replaced by a much more elegant use of [any](!Wikipedia) ~~~{.Haskell} where uniq :: [String] -> [String] uniq = filter (\x ->not $ any (flip isInfixOf x) exceptions) exceptions :: [String] exceptions = ["wikimediafoundation", "http://www.mediawiki.org/", "wikipedia", "&curid=", "index.php?title=", "&action="] ~~~ In our case, we replace `isInfixOf` with `isPrefixOf`, and we have different constants defined in `exceptions`. To put it all together into a new filtering function, we have: ~~~{.Haskell} repos :: String -> [String] repos = uniq . linkify where uniq :: [String] -> [String] uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions) exceptions :: [String] exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"] count :: String -> Bool count x = length (elemIndices '/' x) == 2 ~~~ Our new minimalist program, which will test out `repos`: ~~~{.Haskell} import Data.List (elemIndices, isPrefixOf) import Network.Curl (curlGetString, URLString) import Text.HTML.TagSoup main :: IO () main = do html <- openURL "https://github.com/languages/Haskell/created" print $ repos html openURL :: URLString -> IO String openURL target = fmap snd $ curlGetString target [] linkify :: String -> [String] linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts] repos :: String -> [String] repos = uniq . linkify where uniq :: [String] -> [String] uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions) exceptions :: [String] exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"] count :: String -> Bool count x = length (elemIndices '/' x) == 2 ~~~ The output: ~~~{.Haskell} ["/plategreaves/unordered-containers","/vincenthz/hs-tls-extra","/aculich/fix-symbols-gitit", "/sphynx/euler-hs","/DRMacIver/unordered-containers","/hamishmack/yesod-slides", "/GNUManiacs/hoppla","/DRMacIver/hs-rank-aggregation","/naota/hackage-autoebuild","/magthe/hsini", "/dagit/gnuplot-test","/imbaczek/HBPoker", "/sergeyastanin/simpleea","/cbaatz/hamu808a0","/aristidb/xml-enumerator","/elliottt/value-supply", "/gnumaniacs-org/hoppla","/emillon/tyson", "/quelgar/hifive","/quelgar/haskell-websockets"] ~~~ ## Transforming links (the lazy way) Looks good! We can rest confident. We now have most of our toolkit. What we have left to do is to turn a repo's relative address into an absolute address we could pass to an invocation of git, figure out how to shell out to `git clone`, and then put it all together with some `mapM` plumbing. So here's how things go: 1. We get the very first index with `openURL` and save the delicious HTML contents 2. We parse that with `linkify` and then our laboriously worked out `lastPage` will tell us the # of the oldest index 3. With that knowledge, we know how many URLs to ask `listPages` for 4. Now we call `openURL` on all *those* URLs 5. Parse their results and extract, not the index pages, but the repository pages 6. Somehow turn them into git-comprehensible URLs 7. Somehow run git on all said URLs 1-5 are straightforward given what we already have: ~~~{.Haskell} main :: IO () main = do html <- openURL "https://github.com/languages/Haskell/created" let lst = lastPage $ linkify html let indxPgs = take lst listPages -- we should split out the clause out as a function like `getRepos` repourls <- mapM (fmap repos . openURL) indxPgs ??? ??? ~~~ Hopefully turning a relatively repository URL into an absolute git URL will be a pure operation, so let's tentatively define that function as `gitify :: String -> String`. Let's look at a random repository page (in our example output as `/hamishmack/yesod-slides`), whose git repository is at... `https://github.com/hamishmack/yesod-slides.git`! What luck, there's an obvious transformation there: ~~~{.Haskell} gitify :: String -> String gitify x = "https://github.com" ++ x ++ ".git" ~~~ ## Shelling out to git That leaves the 'shell out to git' functionality. We could try stealing the [spawn](!Hoogle) (call out to `/bin/sh`) code from XMonad, but the point of `spawn` is that it [forks](!Wikipedia "Fork (operating system)") away completely from our script, which will completely screw up our desired *lack* of parallelism.^[A quick point: in the previous scripts, I went to some effort to get greater parallelism, but in this case, we don't want to hammer GitHub with a few thousand simultaneous `git clone` invocations; Haskell repositories are created rarely enough that we can afford to be friendly and only download one repository at a time.] I ultimately wound up using a function from [System.Process](!Hoogle), [readProcessWithExitCode](!Hoogle). (Why `readProcessWithExitCode` and not `readProcess`? Because if a directory already exists, git/`readProcess` throws an exception which kills the script!) This will work: ~~~{.Haskell} shellToGit :: String -> IO () shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u] "" print y ~~~ In retrospect, it might have been a better idea to try to use [`runCommand`](!Hoogle "runCommand") or [`System.Cmd`](!Hoogle "System.Cmd"). Alternatively, we could use the same shelling out functionality from the original [patch-tag.com](http://patch-tag.com/r/tphyahoo/mirrorpatchtag/snapshot/current/content/pretty/patch-tag-mirror.hs) script: ~~~{.Haskell} mapM_ (\x -> runProcess "darcs" ["get", "--lazy", "http://patch-tag.com"++x] Nothing Nothing Nothing Nothing Nothing) targets ~~~ Which could be rewritten for us (sans logging) as ~~~{.Haskell} shellToGit :: String -> IO () shellToGit u = runProcess "git" ["clone", u] Nothing Nothing Nothing Nothing Nothing >> return () -- We could replace `return ()` with Control.Monad.void to drop `IO ProcessHandle` result ~~~ Now it's easy to fill in our 2 missing lines: ~~~{.Haskell} ... repourls <- mapM getRepos indxPgs let gitURLs = map gitify $ concat repourls mapM_ shellToGit gitURLs ~~~ (The `concat` is there because `getRepos` gave us a `[String]` for each `String`, and then we ran it on a `[String]` - so our result is `[[String]]`! But we don't care about preserving the information about where each String came from, so we smush it down to a single list. Strictly speaking, we didn't need to do `print y` in `shellToGit`, but while developing, it's a good idea to have some sort of logging - get a sense of what the script is doing. And once you are printing at all, you can `sort` the list of repository URLs to download them in order by user.) ### Unique repositories There is one subtlety here worth noting that our script is running rough-shod over. Each URL we download is unique, because usernames are unique on GitHub and each URL is formed from a "/username/reponame" pair. But each *downloaded* repository is *not* unique, because git will shuck off the username and create a directory with just the repository name - "/john/bar" and "/jack/bar" will clash, and if you download in that order, the `bar` repository will be John's repository and not Jack's repository. Git will error out the second time, but this error is ignored by the shelling code. The solution would be to tell git to `clone` to a non-default but unique directory (for example, one could reuse the "/username/reponame" and then one's target directory would be neatly populated by several hundred directories named after users, each populated by a few repositories with non-unique names). If we went with the per-user approach, our new version would look like this: ~~~{.Haskell} shellToGit :: String -> IO () shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u, drop 19 u] "" print y ~~~ Why the `drop 19 u`? Well, `u` is the fully qualified URL, eg. "https://www.github.com/sergeyastanin/simpleea". Obviously we don't want to execute `git clone "https://www.github.com/sergeyastanin/simpleea" "https://www.github.com/sergeyastanin/simpleea"` (even though that'd be perfectly valid), because it makes for ugly folders. But `drop 19 "https://www.github.com/sergeyastanin/simpleea"` turns into "sergeyastanin/simpleea", giving us the right local directory name with no prefixed slash. Or you just pass in the original "/username/reponame" and use `drop 1` on that instead. (Either way, you need to do additional work. Might as well just use `drop 19`.) One final note: many of the URLs end in ".git". If we disliked this, then we could enhance the `drop 19` with [System.FilePath.dropExtension](!Hoogle): `dropExtension $ drop 19 u`. ## The script The final program, clean of -Wall or [hlint](!Hackage) warnings: ~~~{.Haskell} import Data.List (elemIndices, isPrefixOf, sort) import Network.Curl (curlGetString, URLString) import System.FilePath (dropExtension) import System.Process (readProcessWithExitCode) import Text.HTML.TagSoup main :: IO () main = do html <- openURL "https://github.com/languages/Haskell/created" let lst = lastPage $ linkify html let indxPgs = take lst listPages repourls <- mapM getRepos indxPgs let gitURLs = map gitify $ sort $ concat repourls mapM_ shellToGit gitURLs openURL :: URLString -> IO String openURL target = fmap snd $ curlGetString target [] linkify :: String -> [String] linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts] lastPage :: [String] -> Int lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`) listPages :: [String] listPages = map (\x -> "https://github.com/languages/Haskell/created?page=" ++ show x) [(1::Int)..] repos :: String -> [String] repos = uniq . linkify where uniq :: [String] -> [String] uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) exceptions) exceptions :: [String] exceptions = ["/languages/", "/login/", "/site/", "http://", "https://"] count :: String -> Bool count x = length (elemIndices '/' x) == 2 getRepos :: String -> IO [String] getRepos = fmap repos . openURL gitify :: String -> String gitify x = "https://github.com" ++ x ++ ".git" shellToGit :: String -> IO () shellToGit u = do (_,y,_) <- readProcessWithExitCode "git" ["clone", u, dropExtension $ drop 19 u] "" print y ~~~ This, or a version of it, works well. But I caution people from mis-using it! There are a *lot* of repositories on GitHub; please don't go running this carelessly. It will pull down 4-12 gigabytes of data. GitHub is a good FLOSS-friendly business by all accounts, and doesn't deserve people wasting its bandwidth & money if they are not even going to keep what they downloaded. ### The script golfed For kicks, let's see what a shorter, more unmaintainable and unreadable, version looks like (in the best scripting language tradition): ~~~{.Haskell} import Data.List (elemIndices, isPrefixOf, sort) import Network.Curl (curlGetString) import System.FilePath (dropExtension) import System.Process (readProcessWithExitCode) import Text.HTML.TagSoup main = do html <- openURL "https://github.com/languages/Haskell/created" let i = take (lastPage $ linkify html) $ map (("https://github.com/languages/Haskell/created?page="++) . show) [1..] repourls <- mapM (fmap (uniq . linkify) . openURL) i mapM_ shellToGit $ map (\x -> "https://github.com" ++ x ++ ".git") $ sort $ concat repourls where openURL target = fmap snd $ curlGetString target [] linkify l = [x | TagOpen "a" atts <- parseTags l, (_,x) <- atts] lastPage = maximum . map (read . drop 32) . filter ("/languages/Haskell/created?page=" `isPrefixOf`) uniq = filter count . filter (\x -> not $ any (`isPrefixOf` x) ["/languages/", "/login/", "/site/", "http://", "https://"]) count x = length (elemIndices '/' x) == 2 shellToGit u = do { (_,y,_) <- readProcessWithExitCode "git" ["clone", u, dropExtension $ drop 19 u] ""; print y } ~~~ 14 lines of code isn't too bad, especially considering that Haskell is not usually considering a language suited for scripting or scraping purposes like this. Nor do I see any obvious missing abstractions - `count` is a function that might be useful in `Data.List`, and `openURL` is something that the Curl binding could provide on its own, but everything else looks pretty necessary. # Exercises for the reader 1. Once one has all those repositories, how does one keep them up to date? The relevant command is `git pull`. How would one run this on all the repositories? In shell script? Using `find`? From a crontab?^[Example cron answer: `@weekly find ~/bin -type d -name ".git" -execdir nice git pull \;`] 2. In the previous script, a number of short-cuts were taken which render it Haskell-specific. Identify and remedy them, turning this script into a general-purpose script for downloading *any* language for which GitHub has a category. (The language name can be accessed by reading an argument to the script by standard functions like [getArgs](!Hoogle).)