Existing standards for hash-based URN schemes

TL;DR: Before inventing a new URI scheme, see if there's already one in use that does what you need.

My recent nerd forum lurking has given me a sense that there's a wave of interest in content distribution networks that use hashes as identifiers. I suspect this comes in part from widespread adoption of and understanding about the structures underpinning distributed version control systems, along with the simultaneous realization that naming things in such a way that you can verify that the bytes you get are the ones you asked for (and maybe something about deduplication, or something philosophical about names) is a good idea.

The idea's not at all new, though, and it turns out there are already [de-facto] standards for identifying files by hash and for fetching them from HTTP servers. I am writing this document to spread the word because wouldn't it be nice if we could all agree on these things so that our systems are interoperable.

Note that the example URNs used below all refer to the string "Hello, world!" (13 bytes, no trailing linefeed). I use the word 'blob' to mean 'byte sequence' (i.e. the contents of a file).

urn:sha1:SQ5HALIG6NCZTLXB7DNI56PXFFQDDVUZ

This URN scheme (along with several variations) is described in an IETF document and is also recognized by some Gnutella clients. The 32-character string following the "urn:sha1:" prefix is the base32-encoded SHA-1 sum of the referenced blob (actually a slight variation on RFC3548 Base32-encoding is used that omits padding).

urn:tree:tiger:276TET7NAXG7FVCDQWOENOX4VABJSZ4GBV7QATQ

References a blob by Tiger-Tree hash [mirror]. The 39-character string following the prefix uses the same base32 encoding as the 'urn:sha1:' scheme.

Merkle trees are nice because, assuming you have access to the internal node data, you can verify parts of the file independently. If you're downloading a 10TB file and one bit gets flipped somewhere, you can identify and re-fetch a section of the file containing that bad bit instead of having to re-download the whole thing.

urn:bitprint:SQ5HALIG6NCZTLXB7DNI56PXFFQDDVUZ.276TET7NAXG7FVCDQWOENOX4VABJSZ4GBV7QATQ

Bitprints are simply an SHA-1 and TigerTree hash concatenated together. In URN form there's a period between those two parts.

Advantages:

Disadvantages:

Overall, I like this scheme and support it in my projects (ContentCouch, PHPN2R), even if they just extract the SHA-1 part and use that.

Comparison with Git object hashes

Although Git uses SHA-1 hashes to reference files, those are not hashes of the file itself. Instead, they are the hash of a small header followed by the file contents. That's why the output of 'sha1sum some-file' doesn't match that of 'git hash-object some-file'. Personally I think it would have been better if Git used the straight SHA-1 sum of the file and stored metadata about how the data is to be interpreted separately (i.e. a bit in the directory entry data structure to indicate if the target is to be interpreted literally or as a directory or symlink or whatever).

That said, I do have some ideas for a URI scheme to reference objects by Git-hash.

Combining with RDF

Because sometimes you want to identify things other than byte sequences.

This part is not any sort of pre-existing standard. I came up with it in 2008 because I wanted to build a flexible Git-like system geared towards storing and versioning very large directory structures containing potentially large files (think media collections).

Essentially, the idea is this: If you want to talk about something that's not a byte sequence using a hash-based URN, you (1) create a document about that thing (that's where the RDF comes in), (2) serialize that document, (3) generate a URN for that document, and (4) add some sort of {pre,post,circum}fix to that URN to indicate 'the thing described by'. For that last part I couldn't find any convention already in use. Postfixing the URN of an RDF document with "#something" comes close, but I didn't want to have to give my RDF nodes IDs; I thought of just using "#" but decided I may as well invent a new prefix because its meaning would be more obvious. The prefix I use is "x-rdf-subject:", giving URNs like "x-rdf-subject:urn:bitprint:B3ZJZ7CSOXEXMZCWFHCBQP4CCSBJET6Y.SDN6FFGJIFX4ODPZ46NCBWNCJQP6APTEX6YRQGY", meaning 'the thing described by urn:bitprint:B3ZJZ7CSOXEXMZCWFHCBQP4CCSBJET6Y.SDN6FFGJIFX4ODPZ46NCBWNCJQP6APTEX6YRQGY, which presumably is some RDF encoding'. (that particular URN references a directory of music files).

Why RDF? Because it's a standard and you can represent anything with it in an umabiguous way. Of course you can apply this same idea using formats other than RDF (and certainly other than XML-encoded RDF).

Fetching over HTTP

RFC2196 covers this topic. I've been implementing a section of it, namely the 'GET /uri-res/N2R?some-urn' part (which should return the blob identified by some-urn). I've also come up with some extensions:

Related stuff

Meta

Discuss this article on reddit.

The author of this article is TOGoS; append two zeroes and an "at gmail.com" to that name to e-mail him.