The concept is simple: you want to make a web reference (HTML link) to an other web resource, but you don’t want that link to go bad. This requires a web link, with just a little bit of smarts on the server, and so I call that a T-Link and it is a key capability for a collaboration system.
The brilliance of the web is the simplicity. Back in the 1980′s the hypertext crowd wasted person-years discussing how to maintain link consistency. Tim Berners-Lee ignored all that, and create the hyperlink that was simply a reference to the server machine, and an address on that machine. If things don’t move, this approach scales to incredible size (the entire Internet) and functions efficiently. However, if things do move, you get the infamous 404 error. Resources do move, DNS names go out of registration, servers stop being maintained. The leftover links are ugly scars on the face of the web.
A rather drastic alternative is to copy all those resources to your own server, and make a local link to them. But a static copy is not nearly as useful as a live link to what might be a very dynamic page. The document being referenced might be updated, leaving the copy out of date. It is better to reference the latest version. A copy of a blog post will not have more recent comments. The promise of web 2.0 is to link into an every increasing set of relationships, and while copying a resource guarantees link integrity, it also cripples the ability to collaborate.
The idea is quite simply the merging of these two extremes. Whenever an author makes a link to an external document or page, the system also makes a local copy of that document or web page. The copy is just in case the resource disappears. When the containing page is served up, it contains either the link to the original resource, or if that is not available, it substitutes a reference to the local cached copy of the resource. In a way, the link is “patched-up” as necessary when remote resources disappear.
The author need only specify the URL of the remote resource. The system then automatically determines what to do from then on. The system will have to periodically test whether the remote resource exists. If the resource exists, the best response for the user is to link to it. If testing for the resource results in an error, then the link to the local copy is used. While the local copy is a poor substitute for the original, it is far better than a link to an error message.
It is called a “T-Link” because it looks similar to a T shaped pipe. It is a link with two possible ways to be fulfilled. The request is the output of the T, the source is either “straight through” to the remote site, or diverted to come from the local copy.
JSP, PHP, etc
In any site built on active page, such as Java Server Pages (JSP) or PHP, most of the logic can be invoked at the time that the page is served up. You need a small database containing (1) the original link URL, (2) the location of the local copy, (3) the status of the remote link, (4) the date of last check. Each embedded link is represented as a procedure call which determines which URL to generate depending upon whether the remote resource is available or not. If it has been a long time since that resource was tested, it might at that moment test for the availability of the resource. Or, to avoid delay during page generation, it might check the availability of resources as a background task on a rotating basis.
Links don’t rot quickly. Checking once a week or so is a reasonable rate. A link working two weeks ago is probably still working today. However, when a server disappears, or when a web site is relocated to a new address, broken links can remain in web pages for years. The idea is to patch those up with reasonable local copies to serve the remaining years of usefulness of that page.
Even a resource marked as missing should be re-checked for reappearance. A site might be down just at the moment that it was being checked. A later check might find it is available, and it should be returned to “live link” status.
Static Web Sites
A static site with HTML file can still make use of T-Links as long as you have some sort of periodic process that runs on the files. The background process would run periodically, parsing the HTML files, and pulling every link out. It would check the database, possibly downloading a copy of a resource if there was no copy yet, and checking that the resource is still there if it has been long enough. If the resource has disappeared, it would modify the HTML file to contain the URL to the local copy resource. Each “href” contains either the remote address, or the local address, of a given resource. Running this background operation every couple of days on a static site might go a long ways toward automatically patching all the links — even for static HTML pages.
Here we get to the interesting part. What does it mean to make a T-Link to unlicensed material? On the surface it appears that you are copying this content, and then republishing it. But is that really what you are doing?
First of all, we should be clear that this is not plagiarism. There is no mechanism to misrepresent the source or authorship of the original content. All the original credit and attribution remains in place. Ideally, the served copy is identical to the original.
Second, if the original content is available, you are simply linking to that content. There is no intent to republish anything that is still available. Any originator concerned about advertising revenue, and who takes care to provide a permanent location for content never encounters any issue. It is only an issue only when the original content disappears from it’s location.
It could be argued that all you are doing is “time-shifting”. The legal precedent behind video tape machines (thank you Sony) is that you are allowed to record something, and play it back later. Isn’t that what a T-Link is? It is just a recording of what was on the web made available for playback later, in the same spirit as the ‘way-back’ machine site, and other web archiving sites. There is question of whether this is “personal use” or not, but the spirit is clearly around making something that was freely available at one time, available at another time which for some technical reason can not be accessed from the original source.
There is a deeper issue though, around free speech and right to records. There are some differences between the way print media and online media work that need to be carefully considered. If you print a book, and sell me a copy, you have no legal right to ask me to return it. I can keep my copy forever, and refer to it as often as I want to. While I may not be allowed to publish the contents as a new book, still you do not have a right to deny me access to the book I purchased, and you do not have a right to keep me from lending it to my friends to read. There is no government guaranteed “right of revocation” to remove content from someone who has received it.
On the web, you copy a document just to read it. Thus the digital download is analogous to reading a book, not publishing a book. (Of course, book and other content publishers don’t see it this way … yet.) There is a huge unresolved area about what it means to block access to something that was formerly available. The more that we see the web as “speech” we find that the web blurs the distinction between tangible and intangible content, and copyright only covers tangible media.
Yet all this may be immaterial since the major source of link rot is due to organizations that disappear. Examples: a forum with questions and answers which no longer has a supporting organization; a blogger who has lost interest and stopped paying for the domain name; an organization that set up a wiki, and then stopped paying for the hosting of it. In these cases there is nobody concerned about a copy. On the contrary, the copy is a service that would be welcomed by the owner.
Automating Link Repair
If an organizations cares about it, the DMCA requires that such an organization must contact the site owner and ask that the duplicated content be removed. An automated mechanism could handle this.
This is actually an opportunity to repair the link. An organization still in the business of providing the content at a new location, might provide the address of the new location, and update all the links. This is the patch-up capability that the original hypertext crowd was wanting. It is ironic to note that without the copy of the content, the broken link is impossible to interpret; it is actually the copy, which can be found, that enables the link to be fixed.
In the admittedly rare case that an organization has moved the content and does not want to update the link to the new place, then an option could be provided to remove all links to that content. The reason I consider this rare, is that there is a business case for wanting links to keep coming to an organization site. There is also a case where the organization disappears, and does not care. But it will be rare that an organization will be motivated to go around and remove links to their sites. However rare, they can easily be accommodated.
The net result: no more link rot