Link Rot - The Bane of Online Citations

Here at H.M. Pharma Consultancy few things make us sigh more than getting the notorious “404 – Page not found” error code displayed — which means that the server is responding but the link clicked to reach a specific page leads to nowhere. “301 – Moved Permanently” or “410 – Gone” are slightly more informative but rarely used (and the frustration is the same anyway). And of course there are the all-too-frequent cases of DNS errors where the link points to a server name or web domain that no longer exits. This gets increasingly troublesome when research has to reach a bit further back in time, e.g. when we are researching companies that have been take over.

What is a nuisance during on-the-spot web navigation quickly becomes a massive problem for publishing when links that are supposed to provide pivotal online references eventually go dead over the years. This applies to scientific papers, corporate reports and legal expertise in equal measure.

The degree to which everybody seems to accept eventual link rot as a fact of digital life has always amazed me. There are broad, global, and relatively well-funded (public, private, and public-private) projects underway that address digital preservation, but they focus on preserving and keeping accessible what is provided as files. The fluid world of the websitesis mostly left to fragmented and automated crawler-based archiving services such as the Internet Archive (annual budget: $10 million) with its Wayback Machine, and of course to the web search machines’ archives where you typically enter the URL preceded by “cache:” or “cached”. None of these are consistently useful for researchers.

Sure, as far as mainstream peer review papers are concerned the National Library of Medicines’ PubMed Identifiers (PMIDs) provide links to journal abstracts that can be assumed to be reasonably permanent, and Digital Object Identifiers (DOIs) are designed for archiving and referencing. But to be eligible for these, a document has to be stable. Information that is published ad hoc will never get either identifier, or see the light of CrossRef. Breaking news in posters or oral presentation abstracts on conference websites, and candidate drug development information on corporate websites belong in this category. If you provide links to such web documents as references in your paper or report, it is a pretty safe bet that they will be broken after a few years. Worse even, the target web locus might still exist but its content might have changed – after all you can never control what webmasters will do. If you want to use online citations as references, what you need is a permanently archived snapshot with an equally permanent and unique link.

There are only two web-based services of any note that allow any user to initiate a snapshot of any given website (unless it has a NOARCHIVE flag in its robots.txt file) on demand without charge AND provide a permanent public URL to that snapshot AND are committed to archiving the snapshot under this link for an indefinite time. Only one of these, WebCite, is directly tailored to the requirements of scientific researchers. It appears to be in permanent financial troubles; there have been downtimes, and since January 2013 their website has a big red header announcing that WebCite will cease accepting new archiving requests by the end of the year unless its FundRazr crowdfunding campaign is successful. The funding target: $25,000. Donations so far (as of March 11, 2013): $3,246. The other service, Archive.is, has a much more general scope but has the advantage of being more tuned towards Web 2.0 contents. It also stores a bitmap partial image along with the archived text. Few seem to know about it, or care.

This situation amounts to a scandal of knowledge management. Providing permanent public links to permanent archival snapshots has to be a centralized, reliable service with staying power. If the Wikimedia Foundation (which explicitly encourages using WebCite to guard against link rot; see here) had total cash expenditures of $27 million in 2011-12, and can budget revenues of $46.1 million for 2012-13, would it not be reasonable to channel a tiny fraction of that towards WebCite to put it back on its feet and guarantee its future? Or perhaps to even take it under Wikipedia’s wing? Or – dear me, what phantasy! – what if UNESCO would provide such a service, in keeping with its lofty mission? The required funds would probably disappear among the rounding errors.