We explore the availability and persistence of URLs cited in articles published in D-Lib Magazine. We extracted 4387 unique URLs referenced in 453 articles published from July 1995 to August 2004. The availability was checked three times a week for 25 weeks from September 2004 to February 2005. We found that approximately 28% of those URLs failed to resolve initially, and 30% failed to resolve at the last check. A majority of the unresolved URLs were due to 404 (page not found) and 500 (internal server error) errors. The content pointed to by the URLs was relatively stable; only 16% of the content registered more than a 1 KB change during the testing period. We explore possible factors which may cause a URL to fail by examining its age, path depth, top-level domain and file extension. Based on the data collected, we found the half-life of a URL referenced in a D-Lib Magazine article is approximately 10 years. We also found that URLs were more likely to be unavailable if they pointed to resources in the .net, .edu or country-specific top-level domain, used non-standard ports (i.e., not port 80), or pointed to resources with uncommon or deprecated extensions (e.g., .shtml, .ps, .txt).
The emergence of the web has fundamentally affected most aspects of information communication, including scholarly communication. The immediacy that characterizes publishing information to the web, as well as accessing it, allows for a dramatic increase in the speed of dissemination of scholarly knowledge. But, the transition from a paper-based to a web-based scholarly communication system also poses challenges. In this paper, we focus on reference rot, the combination of link rot and content drift to which references to web resources included in Science, Technology, and Medicine (STM) articles are subject. We investigate the extent to which reference rot impacts the ability to revisit the web context that surrounds STM articles some time after their publication. We do so on the basis of a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we determine whether the HTTP URI is still responsive on the live web and whether web archives contain an archived snapshot representative of the state the referenced resource had at the time it was referenced. We observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten. We suggest that, in order to safeguard the long-term integrity of the web-based scholarly record, robust solutions to combat the reference rot problem are required. In conclusion, we provide a brief insight into the directions that are explored with this regard in the context of the Hiberlink project.
2006-wren.pdf: “Uniform Resource Locator Decay in Dermatology Journals: Author Attitudes and Preservation Practices”, (2006-09; ):
Objectives: To describe dermatology journal uniform resource locator (URL) use and persistence and to better understand the level of control and awareness of authors regarding the availability of the URLs they cite.
Design: Software was written to automatically access URLs in articles published between January 1, 1999, and September 30, 2004, in the 3 dermatology journals with the highest scientific impact. Authors of publications with unavailable URLs were surveyed regarding URL content, availability, and preservation.
Main Outcome Measures: Uniform resource locator use and persistence and author opinions and practices.
Results: The percentage of articles containing at least 1 URL increased from 2.3% in 1999 to 13.5% in 2004. Of the 1113 URLs, 81.7% were available (decreasing with time since publication from 89.1% of 2004 URLs to 65.4% of 1999 URLs) (p < 0.001). Uniform resource locator unavailability was highest in The Journal of Investigative Dermatology (22.1%) and lowest in the Archives of Dermatology (14.8%) (p = 0.03). Some content was partially recoverable via the Internet Archive for 120 of the 204 unavailable URLs. Most authors (55.2%) agreed that the unavailable URL content was important to the publication, but few controlled URL availability personally (5%) or with the help of others (employees, colleagues, and friends) (6.7%).
Conclusions: Uniform resource locators are increasingly used and lost in dermatology journals. Loss will continue until better preservation policies are adopted.
Objectives: To determine the prevalence and inaccessibility of Internet references in the bibliography of biomedical publications when first released in PubMed®.
Methods: During a one-month observational study period (Feb 21 to Mar 21, 2006) the Internet citations from a 20% random sample of all forthcoming publications released in PubMed during the previous day were identified. Attempts to access the referenced Internet citations were completed within one day and inaccessible Internet citations were recorded.
Results: The study included 4,699 publications from 844 different journals. Among the 141,845 references there were 840 (0.6%) Internet citations. One or more Internet references were cited in 403 (8.6%) articles. From the 840 Internet references, 11.9% were already inaccessible within two days after an article’s release to the public.
Conclusion: The prevalence of Internet citations in journals included inis small (<1%); however, the inaccessibility rate at the time of publication is considered substantial. Authors, editors, and publishers need to take responsibility for providing accurate and accessible Internet references.
Social media content has grown exponentially in the recent years and the role of social media has evolved from just narrating life events to actually shaping them. In this paper we explore how many resources shared in social media are still available on the live web or in public web archives. By analyzing six different event-centric datasets of resources shared in social media in the period from June 2009 to March 2012, we found about 11% lost and 20% archived after just a year and an average of 27% lost and 41% archived after two and a half years. Furthermore, we found a nearly linear relationship between time of sharing of the resource and the percentage lost, with a slightly less linear relationship between time of sharing and archiving coverage of the resource. From this model we conclude that after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.
1997-carter-shotetsu-unforgottendreams.pdf: “Unforgotten Dreams: Poems by the Zen Monk Shōtetsu”, (1997; ):
[This volume presents translations of over 200 poems by Shōtetsu, who is generally considered to be the last great poet of the uta form. Includes an introduction, a glossary of important names and places and a list of sources of the poems.]
The Zen monk(1381–1459) suffered several rather serious misfortunes in his life: he lost all the poems of his first thirty years—more than 30,000 of them—in a fire; his estate revenues were confiscated by an angry shogun; and rivals refused to allow his work to appear in the only imperially commissioned poetry anthology of his time. Undeterred by these obstacles, he still managed to make a living from his poetry and won recognition as a true master, widely considered to be the last great poet of the classical uta, or waka, tradition. viewed his poetry as both a professional and religious calling, and his extraordinarily prolific corpus comprised more than 11,000 poems—the single largest body of work in the Japanese canon.
The first major collection of Shōtetsu’s work in English, Unforgotten Dreams presents beautifully rendered translations of more than two hundred poems. The book opens with Steven Carter’s generous introduction on Shōtetsu’s life and work and his importance in Japanese literature, and includes a glossary of important names and places and a list of sources of the poems. Revealing as never before the enduring creative spirit of one of Japan’s greatest poets, this fine collection fills a major gap in the English translations of medieval Japanese literature.
“Where Did the Web Archive Go?”, (2021-08-12):
To perform a longitudinal investigation of web archives and detecting variations and changes replaying individual archived pages, or mementos, we created a sample of 16,627 mementos from 17 public web archives. Over the course of our 14-month study (November, 2017—January, 2019), we found that four web archives changed their base URIs and did not leave a machine-readable method of locating their new base URIs, necessitating manual rediscovery. Of the 1,981 mementos in our sample from these four web archives, 537 were impacted: 517 mementos were rediscovered but with changes in their time of archiving (or Memento-Datetime), HTTP status code, or the string comprising their original URI (or URI-R), and 20 of the mementos could not be found at all.
“How Much of the Web Is Archived?”, (2012-12-26):
Although the Internet Archive’s Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection development policies, these archives have varying levels of overlap with each other. While individual archives can be measured in terms of number of URIs, number of copies per URI, and intersection with other archives, to date there has been no answer to the question “How much of the Web is archived?” We study the question by approximating the Web using sample URIs from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the number of copies of the sample URIs exist in various public web archives. Each sample set provides its own bias. The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived copy, 17%-49% has between 2–5 copies, 1%-8% has 6–10 copies, and 8%-63% has more than 10 copies in public web archives. The number of URI copies varies as a function of time, but no more than 31.3% of URIs are archived more than once per month.