Update 8/20: tr.im has since threatened to shut down (then not), and bit.ly has proposed the FDIC of shortners. I look back on my post below and concede that the hashing thing will never happen, but the central backup surely should occur, and 301works.org is the best chance it will come about. Any service that claims protecting your “privacy” prevents them from backing up their link archive with a 3rd party is lying or misunderstands the purpose of a central repository and shouldn’t be trusted with your links anyway.
The latest brouhaha over URL shorteners is overblown. Are they evil or unnecessary? Who cares? They’re not going away. Proclaiming the death of “unnecessary” institutions is a tired cliché. I’m all for self-documenting pretty URLs like the next guy, but come on.
I feel that the core argument boils down to discomfort with the opaqueness of the the whole thing. Where does a shortened link go? You don’t know until you’re there. What kind of permanence does the shortening service itself have? This kind of worry is an echo of the greater distrustful zeitgeist though – if AIG and GM can teeter on the edge, what hope can we have for TinyURL? That’s probably how Philistine foot soldiers felt when Goliath fell.
It appears that many shorteners use an auto-incrementing database key to store link references. Thus the first link is http://example.com/1, and the second is http://example.com/2, and by using all letters of the alphabet (using base 36) then link number 999,999,999 will only be http://example.com/GJDGXR, six extra characters. Since they’re not all using the same database, then the true link for http://shortenerA.com/GJDGXR does not relate to http://shortenerB.com/GJDGXR. If Shortner A looses their database (it happens) then goodbye all links. Thus, shortening services are opaque and creaky by design, namely big-honkin’ private central database.
If the digerati can’t abide secrets, then they should gather up their pitchforks and demand that their favorite shortening and bookmarking services transition away from private, proprietary incrementing keys to an open URL shortening hashing scheme and distributed URL repository for backup. How would this work? Well, Netcraft reported that there were 224,749,695 servers on the Internet in March 2009. Let’s just round that up to 250 million and say that each site has an average of 200 URLs, which is 50 billion, or MYWPIWW — just seven characters — in base-36. You could normalize and hash any URL through a 40-bit hashing function and format that number to a base-36 string. All participating open shortening services would have to use the same hashing method and synchronize their URLs with a central repository. By using a hash function, each service could defer repository synchronization and use it as an insurance policy. The central repository would be queried only when a hash doesn’t exist locally such as when a competing service goes dark or offline. Storage of 5 billion URL records in the repository (assuming each record is 1kb) would require 4.65 TB of storage, not quite doable with Amazon SimpleDB just yet. Nevertheless, I can easily imagine two guys hacking up a robust shortened URL repository at this scale over a weekend using tools like Amazon’s EC2, SDB, and SQS and charging per-use fees to the services for a syncing API. The repository has the added benefit of being a neutral third party that can validate and lengthen links for security.
I have no illusions that this system will ever actually be built, or if it is, that it will be used widely. You’re welcome anyway.


One Response
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.
Continuing the Discussion