title: Simplifying the blob downloader date: 2013-09-20 tef:content-type: text/html

So there's this blob downloader project I've had on my mind for several years. The idea is to take a list of SHA-1 or Bitprint URNs and download them from wherever they can be found. ContentCouch can do this, but it takes forever because it only checks one repository at a time and keeps using the last one that it had success downloading from, even if that happens to be over a really slow link. And I wrote that program when I had been reading about Netkernel and it's full of overly componentized and resourcified code that seemed a good idea at the time but turns out to be completely unnecessary levels of abstraction and makes the program pretty much impossible to extend or debug. So this project has been calling for a fresh start.

The new version needs to be able to take a stream of URNs on standard input (or from a file) and distribute the requests evenly across multiple servers as soon as possible, recover from 404s by trying again on a different server, and, if the command-line flag is given indicating that referenced blobs should be recursively downloaded, scan completed downloads (or possibly even ones in progress) for URNs to add back to the download queue. Somehow it has to know when all the queues are empty so the program can exit.

I did a lot of brainstorming on how the program would work. I drew a lot of diagrams of components with queues running between them and explanations of what would be passed through. I made an attempt to write the thing in Node.js which mysteriously didn't work (I have no idea how to structure or debug callback-laden Node code). I made some more diagrams and started writing Java classes that took queues as constructor arguments and then ran into queueing problems (things like: process A can block while pushing into queue Q, but process B cannot) and got frustrated with how complicated things were getting and gave up.

In these designs I always tried to either:

Then the other night I dreamt up what seemed like a simpler solution while eating cereal. Have a bunch of threads (equal to the maximum number of connections you want to make) each take a URN to download from a queue, attempt to download it until completed or all remote repositories have been tried, add (non-blockingly) any referenced URNs if recursion is enabled, then move on to the next URN from the queue.

I blew away my non-working component-based code and implemented a downloader using the cereal model in about 4 hours. A RepositorySet object is in charge of locking and releasing connection slots (these control access to remote repositories and are about 2 per), and the download threads blockingly request slots to download from, passing in a list of repositories that have already been determined to not contain the blob (this request blocks only if all untried slots are in use). I wrote a custom queue-like class with 3 methods (put, add, and take) that allows one to enqueue a URN blockingly (waits until the internal list is under a certain size; this is used by the main thread) or non-blockingly (adds it to the list unconditionally and returns immediately; this is used by the download threads to recurse because you don't want the download threads to deadlock) without bothering to implement the Queue collection interface. The whole thing feels pretty self-contained and sits nicely in a single 441-line Java source file.

I guess the lesson here is to not get so obsessed with cutsesy design patterns ('everything should be controlled by an event loop', 'everything should be a component and talk via queues', 'anything that acts like a queue should implement Queue') that you overlook simple, effective solutions.