Skip to main content

archiving directory


“Internet Search Tips”, Branwen 2018

Search: “Internet Search Tips”⁠, Gwern Branwen (2018-12-11; ⁠, ⁠, ⁠, ; backlinks; similar):

A description of advanced tips and tricks for effective Internet research of papers/​books, with real-world examples.

Over time, I developed a certain google-fu and expertise in finding references, papers, and books online. Some of these tricks are not well-known, like checking the Internet Archive (IA) for books. I try to write down my search workflow, and give general advice about finding and hosting documents, with demonstration case studies⁠.

“SMPY Bibliography”, Branwen 2018

SMPY: “SMPY Bibliography”⁠, Gwern Branwen (2018-07-28; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

An annotated fulltext bibliography of publications on the Study of Mathematically Precocious Youth (SMPY), a longitudinal study of high-IQ youth.

SMPY (Study of Mathematically Precocious Youth) is a long-running longitudinal survey of extremely mathematically-talented or intelligent youth, which has been following high-IQ cohorts since the 1970s. It has provided the largest and most concrete findings about the correlates and predictive power of screening extremely intelligent children, and revolutionized gifted & talented educational practices.

Because it has been running for over 40 years, SMPY-related publications are difficult to find; many early papers were published only in long-out-of-print books and are not available in any other way. Others are digitized and more accessible, but one must already know they exist. Between these barriers, SMPY information is less widely available & used than it should be given its importance.

To fix this, I have been gradually going through all SMPY citations and making fulltext copies available online with occasional commentary.

“Easy Cryptographic Timestamping of Files”, Branwen 2015

Timestamping: “Easy Cryptographic Timestamping of Files”⁠, Gwern Branwen (2015-12-04; ⁠, ⁠, ⁠, ; backlinks; similar):

Scripts for convenient free secure Bitcoin-based dating of large numbers of files/​strings

Local archives are useful for personal purposes, but sometimes, in investigations that may be controversial, you want to be able to prove that the copy you downloaded was not modified and you need to timestamp it and prove the exact file existed on or before a certain date. This can be done by creating a cryptographic hash of the file and then publishing that hash to global chains like centralized digital timestampers or the decentralized Bitcoin blockchain. Current timestamping mechanisms tend to be centralized, manual, cumbersome, or cost too much to use routinely. Centralization can be overcome by timestamping to Bitcoin; costing too much can be overcome by batching up an arbitrary number of hashes and creating just 1 hash/​timestamp covering them all; manual & cumbersome can be overcome by writing programs to handle all of this and incorporating them into one’s workflow. So using an efficient cryptographic timestamping service (the OriginStamp Internet service), we can write programs to automatically & easily timestamp arbitrary files & strings, timestamp every commit to a Git repository, and webpages downloaded for archival purposes. We can implement the same idea offline, without reliance on OriginStamp, but at the cost of additional software dependencies like a Bitcoin client.

“Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot”, Klein et al 2014

“Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot”⁠, Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou et al (2014-12-26; ; backlinks; similar):

The emergence of the web has fundamentally affected most aspects of information communication, including scholarly communication. The immediacy that characterizes publishing information to the web, as well as accessing it, allows for a dramatic increase in the speed of dissemination of scholarly knowledge. But, the transition from a paper-based to a web-based scholarly communication system also poses challenges. In this paper, we focus on reference rot, the combination of link rot and content drift to which references to web resources included in Science, Technology, and Medicine (STM) articles are subject. We investigate the extent to which reference rot impacts the ability to revisit the web context that surrounds STM articles some time after their publication. We do so on the basis of a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we determine whether the HTTP URI is still responsive on the live web and whether web archives contain an archived snapshot representative of the state the referenced resource had at the time it was referenced. We observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten. We suggest that, in order to safeguard the long-term integrity of the web-based scholarly record, robust solutions to combat the reference rot problem are required. In conclusion, we provide a brief insight into the directions that are explored with this regard in the context of the Hiberlink project.

“The Sort --key Trick”, Branwen 2014

Sort: “The sort --key Trick”⁠, Gwern Branwen (2014-03-03; ⁠, ⁠, ⁠, ; backlinks; similar):

Commandline folklore: sorting files by filename or content before compression can save large amounts of space by exposing redundancy to the compressor. Examples and comparisons of different sorts.

Programming folklore notes that one way to get better lossless compression efficiency is by the precompression trick of rearranging files inside the archive to group ‘similar’ files together and expose redundancy to the compressor, in accordance with information-theoretical principles. A particularly easy and broadly-applicable way of doing this, which does not require using any unusual formats or tools and is fully compatible with the default archive methods, is to sort the files by filename and especially file extension.

I show how to do this with the standard Unix command-line sort tool, using the so-called “sort --key trick”, and give examples of the large space-savings possible from my archiving work for personal website mirrors and for making darknet market mirror datasets where the redundancy at the file level is particularly extreme and the sort --key trick shines compared to the naive approach.

“Darknet Market Archives (2013–2015)”, Branwen 2013

DNM-archives: “Darknet Market Archives (2013–2015)”⁠, Gwern Branwen (2013-12-01; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Mirrors of ~89 Tor-Bitcoin darknet markets & forums 2011–2015, and related material.

Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services providing escrow services between buyers & sellers transacting in Bitcoin or other cryptocoins, usually for drugs or other illegal/​regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model in 2011.

From 2013–2015, I scraped/​mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage⁠, lifetimes /  ​ characteristics⁠, & legal riskiness⁠; these scrapes covered vendor pages, feedback, images, etc. In addition, I made or obtained copies of as many other datasets & documents related to the DNMs as I could.

This uniquely comprehensive collection is now publicly released as a 50GB (~1.6TB uncompressed) collection covering 89 DNMs & 37+ related forums, representing <4,438 mirrors, and is available for any research.

This page documents the download, contents, interpretation, and technical methods behind the scrapes.

“Predicting Google Closures”, Branwen 2013

Google-shutdowns: “Predicting Google closures”⁠, Gwern Branwen (2013-03-28; ⁠, ⁠, ⁠, ; backlinks; similar):

Analyzing predictors of Google abandoning products; predicting future shutdowns

Prompted by the shutdown of Google Reader⁠, I ponder the evanescence of online services and wonder what is the risk of them disappearing. I collect data on 350 Google products launched before March 2013, looking for variables predictive of mortality (web hits, service vs software, commercial vs free, FLOSS, social networking, and internal vs acquired). Shutdowns are unevenly distributed over the calendar year or Google’s history. I use logistic regression & survival analysis (which can deal with right-censorship) to model the risk of shutdown over time and examine correlates. The logistic regression indicates socialness, acquisitions, and lack of web hits predict being shut down, but the results may not be right. The survival analysis finds a median lifespan of 2824 days with a roughly Type III survival curve (high early-life mortality); a Cox regression finds similar results as the logistic - socialness, free, acquisition, and long life predict lower mortality. Using the best model, I make predictions about probability of shutdown of the most risky and least risky services in the next 5 years (up to March 2018). (All data & R source code is provided.)

“How Much of the Web Is Archived?”, Ainsworth et al 2012

“How Much of the Web Is Archived?”⁠, Scott G. Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson (2012-12-26; backlinks; similar):

Although the Internet Archive’s Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection development policies, these archives have varying levels of overlap with each other. While individual archives can be measured in terms of number of URIs, number of copies per URI, and intersection with other archives, to date there has been no answer to the question “How much of the Web is archived?” We study the question by approximating the Web using sample URIs from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the number of copies of the sample URIs exist in various public web archives. Each sample set provides its own bias. The results from our sample sets indicate that range from 35%–90% of the Web has at least one archived copy, 17%–49% has between 2–5 copies, 1%–8% has 6–10 copies, and 8%–63% has more than 10 copies in public web archives. The number of URI copies varies as a function of time, but no more than 31.3% of URIs are archived more than once per month.

“Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?”, SalahEldeen & Nelson 2012

“Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?”⁠, Hany M. SalahEldeen, Michael L. Nelson (2012-09-13; ; backlinks; similar):

Social media content has grown exponentially in the recent years and the role of social media has evolved from just narrating life events to actually shaping them. In this paper we explore how many resources shared in social media are still available on the live web or in public web archives. By analyzing six different event-centric datasets of resources shared in social media in the period from June 2009 to March 2012, we found about 11% lost and 20% archived after just a year and an average of 27% lost and 41% archived after two and a half years. Furthermore, we found a nearly linear relationship between time of sharing of the resource and the percentage lost, with a slightly less linear relationship between time of sharing and archiving coverage of the resource. From this model we conclude that after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.

“Archiving GitHub”, Branwen 2011

Archiving-GitHub: “Archiving GitHub”⁠, Gwern Branwen (2011-03-20; ⁠, ; similar):

Scraping and downloading Haskell-related repositories from GitHub

Tutorial of how to write a Haskell program to scrape Haskell-related repositories on GitHub and download them for offline installation, search, reference, and source code analysis, using TagSoup & curl⁠.


“Archiving URLs”, Branwen 2011

Archiving-URLs: “Archiving URLs”⁠, Gwern Branwen (2011-03-10; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Archiving the Web, because nothing lasts forever: statistics, online archive services, extracting URLs automatically from browsers, and creating a daemon to regularly back up URLs to multiple sources.

Links on the Internet last forever or a year, whichever comes first. This is a major problem for anyone serious about writing with good references, as link rot will cripple several% of all links each year, and compounding.

To deal with link rot, I present my multi-pronged archival strategy using a combination of scripts, daemons, and Internet archival services: URLs are regularly dumped from both my web browser’s daily browsing and my website pages into an archival daemon I wrote, which pre-emptively downloads copies locally and attempts to archive them in the Internet Archive. This ensures a copy will be available indefinitely from one of several sources. Link rot is then detected by regular runs of linkchecker, and any newly dead links can be immediately checked for alternative locations, or restored from one of the archive sources.

As an additional flourish, my local archives are efficiently cryptographically timestamped using Bitcoin in case forgery is a concern, and I demonstrate a simple compression trick for substantially reducing sizes of large web archives such as crawls (particularly useful for repeated crawls such as my DNM archives).

“Design Graveyard”, Branwen 2010

Design-graveyard: “Design Graveyard”⁠, Gwern Branwen (2010-10-01; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Meta page describing website design experiments and post-mortem analyses.

Often the most interesting part of any design are the parts that are invisible—what was tried but did not work. Sometimes they were unnecessary, other times users didn’t understand them because it was too idiosyncratic, and sometimes we just can’t have nice things.

Some post-mortems of things I tried on but abandoned (in chronological order).

“Design Of This Website”, Branwen 2010

Design: “Design Of This Website”⁠, Gwern Branwen (2010-10-01; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Meta page describing site implementation and experiments for better ‘structural reading’ of hypertext; technical decisions using Markdown and static hosting. is implemented as a static website compiled via Hakyll from Pandoc Markdown and hosted on a dedicated server (due to expensive cloud bandwidth).

It stands out from your standard Markdown static website by aiming at good typography, fast performance, and advanced hypertext browsing features (at the cost of great implementation complexity); the 4 design principles are: aesthetically-pleasing minimalism, accessibility/​progressive-enhancement, speed, and a ‘structural reading’ approach to hypertext use.

Unusual features include the monochrome esthetics, sidenotes instead of footnotes on wide windows, efficient drop caps/​smallcaps, collapsible sections, automatic inflation-adjusted currency, Wikipedia-style link icons & infoboxes, custom syntax highlighting⁠, extensive local archives to fight linkrot, and an ecosystem of “popup”/​“popin” annotations & previews of links for frictionless browsing—the net effect of hierarchical structures with collapsing and instant popup access to excerpts enables iceberg-like pages where most information is hidden but the reader can easily drill down as deep as they wish. (For a demo of all features & stress-test page, see Lorem Ipsum⁠.)

Also discussed are the many failed experiments /  ​ changes made along the way.

“About This Website”, Branwen 2010

About: “About This Website”⁠, Gwern Branwen (2010-10-01; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Meta page describing site ideals of stable long-term essays which improve over time; idea sources and writing methodology; metadata definitions; site statistics; copyright license.

This page is about content; for the details of its implementation & design like the popup paradigm, see Design⁠; and for information about me, see Links⁠.

“Writing a Wikipedia RSS Link Archive Bot”, Branwen 2009

Wikipedia-RSS-Archive-Bot: “Writing a Wikipedia RSS Link Archive Bot”⁠, Gwern Branwen (2009-11-02; ⁠, ⁠, ; backlinks; similar):

Archiving using Wikipedia Recent Changes RSS feed (obsolete).

Continuation of the 2009 Haskell Wikipedia link archiving bot tutorial, extending it from operating on a pre-specified list of articles to instead archiving links live by using TagSoup parsing Wikipedia Recent Changes for newly-added external links which can be archived using WebCite in parallel. (Note: these tutorials are obsolete. WebCite is largely defunct, doing archiving this way is not advised, and WP link archiving is currently handled by Internet Archive-specific plugins by the WMF. For a more general approach suitable for personal use, see the writeup of archiver-bot in Archiving URLs⁠.)

“Writing a Wikipedia Link Archive Bot”, Branwen 2008

Wikipedia-Archive-Bot: “Writing a Wikipedia Link Archive Bot”⁠, Gwern Branwen (2008-09-26; ⁠, ⁠, ; backlinks; similar):

Haskell: tutorial on writing a daemon to archive links in Wikipedia articles with TagSoup and WebCite; obsolete.

This is a 2008 tutorial demonstrating how to write a Haskell program to automatically archive Internet links into WebCite & Internet Archive to avoid linkrot, by parsing WP dumps, downloading & parsing WP articles for external links with the TagSoup HTML parsing library, using the WebCite/​IA APIs to archive them, and optimizing runtime. This approach is suitable for one-off crawls but not for live archiving using the RSS feed; for the next step, see Wikipedia RSS Archive Bot for a demonstration of how one could write a RSS-oriented daemon.


“The Prevalence and Inaccessibility of Internet References in the Biomedical Literature at the Time of Publication”, Aronsky et al 2007

“The Prevalence and Inaccessibility of Internet References in the Biomedical Literature at the Time of Publication”⁠, Dominik Aronsky, Sina Madani, Randy J. Carnevale, Stephany Duda, Michael T. Feyder (2007-03; backlinks; similar):

Objectives: To determine the prevalence and inaccessibility of Internet references in the bibliography of biomedical publications when first released in PubMed®.

Methods: During an one-month observational study period (Feb 21 to Mar 21, 2006) the Internet citations from a 20% random sample of all forthcoming publications released in PubMed during the previous day were identified. Attempts to access the referenced Internet citations were completed within one day and inaccessible Internet citations were recorded.

Results: The study included 4,699 publications from 844 different journals. Among the 141,845 references there were 840 (0.6%) Internet citations. One or more Internet references were cited in 403 (8.6%) articles. From the 840 Internet references, 11.9% were already inaccessible within two days after an article’s release to the public.

Conclusion: The prevalence of Internet citations in journals included in PubMed is small (<1%); however, the inaccessibility rate at the time of publication is considered substantial. Authors, editors, and publishers need to take responsibility for providing accurate and accessible Internet references.

Unforgotten Dreams: Poems by the Zen Monk Shōtetsu”, Shōtetsu & Carter 1997

1997-carter-shotetsu-unforgottendreams.pdf: Unforgotten Dreams: Poems by the Zen Monk Shōtetsu⁠, Shōtetsu, Steven D. Carter (1997; ; backlinks; similar):

[This volume presents translations of over 200 poems by Shōtetsu⁠, who is generally considered to be the last great poet of the uta form. Includes an introduction, a glossary of important names and places and a list of sources of the poems.]

The Zen monk Shōtetsu (1381–1459) suffered several rather serious misfortunes in his life: he lost all the poems of his first thirty years—more than 30,000 of them—in a fire; his estate revenues were confiscated by an angry shogun; and rivals refused to allow his work to appear in the only imperially commissioned poetry anthology of his time. Undeterred by these obstacles, he still managed to make a living from his poetry and won recognition as a true master, widely considered to be the last great poet of the classical uta, or waka, tradition. Shōtetsu viewed his poetry as both a professional and religious calling, and his extraordinarily prolific corpus comprised more than 11,000 poems—the single largest body of work in the Japanese canon.

The first major collection of Shōtetsu’s work in English, Unforgotten Dreams presents beautifully rendered translations of more than two hundred poems. The book opens with Steven Carter’s generous introduction on Shōtetsu’s life and work and his importance in Japanese literature, and includes a glossary of important names and places and a list of sources of the poems. Revealing as never before the enduring creative spirit of one of Japan’s greatest poets, this fine collection fills a major gap in the English translations of medieval Japanese literature.