Hash-based URN schemes

Existing standards for hash-based URN schemes

TL;DR: Before inventing a new URI scheme, see if there's already one in use that does what you need.

My recent nerd forum lurking has given me a sense that there's a wave of interest in content distribution networks that use hashes as identifiers. I suspect this comes in part from widespread adoption of and understanding about the structures underpinning distributed version control systems, along with the simultaneous realization that naming things in such a way that you can verify that the bytes you get are the ones you asked for (and maybe something about deduplication, or something philosophical about names) is a good idea.

The idea's not at all new, though, and it turns out there are already [de-facto] standards for identifying files by hash and for fetching them from HTTP servers. I am writing this document to spread the word because wouldn't it be nice if we could all agree on these things so that our systems are interoperable.

Note that the example URNs used below all refer to the string "Hello, world!" (13 bytes, no trailing linefeed). I use the word 'blob' to mean 'byte sequence' (i.e. the contents of a file).

`urn:sha1:SQ5HALIG6NCZTLXB7DNI56PXFFQDDVUZ`

This URN scheme (along with several variations) is described in an IETF document and is also recognized by some Gnutella clients. The 32-character string following the "urn:sha1:" prefix is the base32-encoded SHA-1 sum of the referenced blob (actually a slight variation on RFC3548 Base32-encoding is used that omits padding).

`urn:tree:tiger:276TET7NAXG7FVCDQWOENOX4VABJSZ4GBV7QATQ`

References a blob by Tiger-Tree hash ^[mirror]. The 39-character string following the prefix uses the same base32 encoding as the 'urn:sha1:' scheme.

Merkle trees are nice because, assuming you have access to the internal node data, you can verify parts of the file independently. If you're downloading a 10TB file and one bit gets flipped somewhere, you can identify and re-fetch a section of the file containing that bad bit instead of having to re-download the whole thing.

`urn:bitprint:SQ5HALIG6NCZTLXB7DNI56PXFFQDDVUZ.276TET7NAXG7FVCDQWOENOX4VABJSZ4GBV7QATQ`

Bitprints are simply an SHA-1 and TigerTree hash concatenated together. In URN form there's a period between those two parts.

Advantages:

More bits, so the chance of collision is EVEN LOWER
Does your scripting language lack a Tiger hash function? It probably supports SHA-1, so you can at least verify that.
If you do want to fetch the thing block-by-block (and have a source from which to fetch internal node data), you can do that based off the TigerTree part.

Disadvantages:

More bits, so your URNs are longer and uglier and take up more storage.

Overall, I like this scheme and support it in my projects (ContentCouch, PHPN2R), even if they just extract the SHA-1 part and use that.

Comparison with Git object hashes

Although Git uses SHA-1 hashes to reference files, those are not hashes of the file itself. Instead, they are the hash of a small header followed by the file contents. That's why the output of 'sha1sum some-file' doesn't match that of 'git hash-object some-file'. Personally I think it would have been better if Git used the straight SHA-1 sum of the file and stored metadata about how the data is to be interpreted separately (i.e. a bit in the directory entry data structure to indicate if the target is to be interpreted literally or as a directory or symlink or whatever).

That said, I do have some ideas for a URI scheme to reference objects by Git-hash.

Combining with RDF

Because sometimes you want to identify things other than byte sequences.

This part is not any sort of pre-existing standard. I came up with it in 2008 because I wanted to build a flexible Git-like system geared towards storing and versioning very large directory structures containing potentially large files (think media collections).

Essentially, the idea is this: If you want to talk about something that's not a byte sequence using a hash-based URN, you (1) create a document about that thing (that's where the RDF comes in), (2) serialize that document, (3) generate a URN for that document, and (4) add some sort of {pre,post,circum}fix to that URN to indicate 'the thing described by'. For that last part I couldn't find any convention already in use. Postfixing the URN of an RDF document with "#something" comes close, but I didn't want to have to give my RDF nodes IDs; I thought of just using "#" but decided I may as well invent a new prefix because its meaning would be more obvious. The prefix I use is "x-rdf-subject:", giving URNs like "x-rdf-subject:urn:bitprint:B3ZJZ7CSOXEXMZCWFHCBQP4CCSBJET6Y.SDN6FFGJIFX4ODPZ46NCBWNCJQP6APTEX6YRQGY", meaning 'the thing described by urn:bitprint:B3ZJZ7CSOXEXMZCWFHCBQP4CCSBJET6Y.SDN6FFGJIFX4ODPZ46NCBWNCJQP6APTEX6YRQGY, which presumably is some RDF encoding'. (that particular URN references a directory of music files).

Why RDF? Because it's a standard and you can represent anything with it in an umabiguous way. Of course you can apply this same idea using formats other than RDF (and certainly other than XML-encoded RDF).

Fetching over HTTP

RFC2196 covers this topic. I've been implementing a section of it, namely the 'GET /uri-res/N2R?some-urn' part (which should return the blob identified by some-urn). I've also come up with some extensions:

GET /uri-res/raw/some-urn[/filename-hint] - this allows one to reference a blob in a way that's a bit more natural for web browsers. A filename hint can be included that will presumably be the default if the user chooses to save the file (by linking to '/uri-res/N2R?' resources, users might end up saving a lot of files called "N2R").
PUT /uri-res/N2R?some-urn - PUTting to an 'N2R' URL results in either:
- A 405, 403, or 401 error if for whatever reason you're not allowing PUTs there,
- A 409 if the URN given in the URL doesn't match the hash of the data uploaded, or
- A 200 if the URN matches the data and the data has been stored (either already there or uploaded due to this request).

Related stuff

The Wikipedia page on the Magnet URI scheme - references various hash-based URN schemes for use in the 'xt' part of magnet: URIs.
Freenet - It's been using hashes (and Merkle trees!) to identify files since last century.
PHPN2R and TPFetcher - PHP programs to serve and fetch N2R resources.
ContentCouch - the program I wrote to back up my music/video/photo collections. The code's not in great shape, but the data structures and serialization formats are solid. I re-implemented parts of it (checkout is missing) in ContentCouch3.
An image collection served entirely using /uri-res/raw links.
hash-uri - a more general hash-based URI scheme.
RFC6920 - Naming Things with Hashes - Another RFC for hash-based URNs which happens to have a title very similar to my tech talk (and which I stumbled across only because I had been googling for my own tech talk with those words).

Existing standards for hash-based URN schemes

urn:sha1:SQ5HALIG6NCZTLXB7DNI56PXFFQDDVUZ

urn:tree:tiger:276TET7NAXG7FVCDQWOENOX4VABJSZ4GBV7QATQ

urn:bitprint:SQ5HALIG6NCZTLXB7DNI56PXFFQDDVUZ.276TET7NAXG7FVCDQWOENOX4VABJSZ4GBV7QATQ