Monthly Archives: April 2009

Open Geographic Tagging

I had an idea for making open geographic tags over the weekend so I hacked out a quick proof of concept in JavaScript. If you hate reading, you can skip to my demo here and play with it now.

You may have read the recent noise about URL shortening services being “the herpes of the web“. Please. I think this comes down to tagging being a closed system, but it’s not easy to fix and I doubt these kinds of things will ever go away.

It got me thinking, though: links to geographic information could benefit from shortening, and selectively loosing information could even be an advantage. Latitude and longitude is expressed as a pair of numbers, traditionally divided up by degrees, minutes, and seconds, but digitally expressed as floating numbers between -90/90 and -180/180 respectively. For example, a beach near my home is at 33.54240,-117.78559. For example, say I wanted to tell my friends that I’m at the beach, and my Twitter client could access my GPS and append my location to the message (yes I know about Twinkle). That’s 19 characters out of 140 and it’s very public. Also it’s a single point, not a whole area, and how do I search for other people near me?

My idea is to create a new kind of geographic code that specifies areas, not points, and is constructed in a text-search friendly way. Here’s how it works. First divide the world into a grid of 36 x 36, or 1,296 squares. Assign each row a number or letter using 0-9 and A-Z, that’s why 36. If we specify one of these grids by coordinates, such as “o6″ for example, that would specify a rather large part of the earth, as you can see below. If we were then to reference a point using a further 36 x 36 subdivision of this grid, we could specify an area even further, such as “o6p7″, which would be smaller still, and so on and so forth, all the way down to “o6p7iy6z”, the  rough equivalent of the 19-character coordinates above in only 8 characters.

The more specific you want to be about your location, the more subdivisions your code would include, although using base 36, four levels is about the size of a house, three levels is a few blocks, two levels is about an hour’s drive, and one level is a day’s drive. That sizing feels right to me, but perhaps if this ever gets used people will standardize on different subdivisions, such as base 16 or base 24. Anyway, if you’re sharing your location but want to protect your privacy, leaving your location general and loosing a little specificity is a good thing. The subdivisions function like decimal places in that case. “o6″ means somewhere in SoCal, while “o6p7iy6z” means a certain spot by the volleyball nets on Main Beach in Laguna Beach, California.

When looking for proximity, you can see at a glance if a code is near you. In my case, I wouldn’t bother with things that don’t start with “o6p7″, although I might look at “06p8″ or “o6q7″, which are adjacent. A text search engine could find nearby items tags a wildcard search pattern (“06p7*”).

One other major benefit of a protocol like this is that it would be open. It’s not perfect; you may live close to the boundary of a big grid, and there’s an up-front hassle in figuring out exactly what grid you’re presently in unless your device can calculate your geo tag for you. Once in place, no single company would be the gatekeeper for the information that makes the tags meaningful, as in the case of URL shorteners. A unified location tag protocol could open up many interesting mashups between location aware devices and services, not just Twitter et al.

If someone’s already done something like this, then I wasn’t able to find it and I apologize, please let me know. If you want to fix it up and build upon this idea, by all means go for it but please link back here. Please view the source of the HTML demo file for implementation details and send any errors or suggestions. I think it’s not a perfect square because of the distortion from the map projection, but I could have flubbed something too.

Open & Distributed URL Shortening

Update 8/20: tr.im has since threatened to shut down (then not), and bit.ly has proposed the FDIC of shortners. I look back on my post below and concede that the hashing thing will never happen, but the central backup surely should occur, and 301works.org is the best chance it will come about. Any service that claims protecting your “privacy” prevents them from backing up their link archive with a 3rd party is lying or misunderstands the purpose of a central repository and shouldn’t be trusted with your links anyway.

The latest brouhaha over URL shorteners is overblown. Are they evil or unnecessary? Who cares? They’re not going away. Proclaiming the death of “unnecessary” institutions is a tired cliché. I’m all for self-documenting pretty URLs like the next guy, but come on.

I feel that the core argument boils down to discomfort with the opaqueness of the the whole thing. Where does a shortened link go? You don’t know until you’re there. What kind of permanence does the shortening service itself have? This kind of worry is an echo of the greater distrustful zeitgeist though – if AIG and GM can teeter on the edge, what hope can we have for TinyURL? That’s probably how Philistine foot soldiers felt when Goliath fell.

It appears that many shorteners use an auto-incrementing database key to store link references. Thus the first link is http://example.com/1, and the second is http://example.com/2, and by using all letters of the alphabet (using base 36) then link number 999,999,999 will only be http://example.com/GJDGXR, six extra characters. Since they’re not all using the same database, then the true link for http://shortenerA.com/GJDGXR does not relate to http://shortenerB.com/GJDGXR. If Shortner A looses their database (it happens) then goodbye all links. Thus, shortening services are opaque and creaky by design, namely big-honkin’ private central database.

If the digerati can’t abide secrets, then they should gather up their pitchforks and demand that their favorite shortening and bookmarking services transition away from private, proprietary incrementing keys to an open URL shortening hashing scheme and distributed URL repository for backup. How would this work? Well, Netcraft reported that there were 224,749,695 servers on the Internet in March 2009. Let’s just round that up to 250 million and say that each site has an average of 200 URLs, which is 50 billion, or MYWPIWW — just seven characters — in base-36. You could normalize and hash any URL through a 40-bit hashing function and format that number to a base-36 string. All participating open shortening services would have to use the same hashing method and synchronize their URLs with a central repository. By using a hash function, each service could defer repository synchronization and use it as an insurance policy. The central repository would be queried only when a hash doesn’t exist locally such as when a competing service goes dark or offline. Storage of 5 billion URL records in the repository (assuming each record is 1kb) would require 4.65 TB of storage, not quite doable with Amazon SimpleDB just yet. Nevertheless, I can easily imagine two guys hacking up a robust shortened URL repository at this scale over a weekend using tools like Amazon’s EC2, SDB, and SQS and charging per-use fees to the services for a syncing API. The repository has the added benefit of being a neutral third party that can validate and lengthen links for security.

I have no illusions that this system will ever actually be built, or if it is, that it will be used widely. You’re welcome anyway.

Deleting Duplicate Records in MySQL

I am writing this post to remind myself of a cool trick that I rediscovered today. When you are scrubbing a database table, there is the common problem of removing duplicate data. In my case, I had a table of ZIP codes, city names, and state codes, along with the latitude and longitude values. The problem was that there were a few thousand records where the lat/lon data was slightly off leading to multiple duplicate zip code rows with slightly different data. I have seen solutions for other databases that require a temporary ID field and self joins. It turns out that for MySQL the solution is much simpler. If you use the ALTER IGNORE TABLE command to add a unique index, then your duplicate rows will be silently deleted. Here was the command I used:
ALTER IGNORE TABLE `zip_codes`
ADD UNIQUE INDEX dupes (`region_code`, `city`, `zipcode`);

Finally, just drop the temporary index and you’re done. Here’s the official word from the manual:

IGNORE is a MySQL extension to standard SQL. It controls how ALTER TABLE works if there are duplicates on unique keys in the new table or if warnings occur when strict mode is enabled. If IGNORE is not specified, the copy is aborted and rolled back if duplicate-key errors occur. If IGNORE is specified, only the first row is used of rows with duplicates on a unique key, The other conflicting rows are deleted. Incorrect values are truncated to the closest matching acceptable value.