Simplifying the blob downloader - 2013-09-20 - Entry 4

So there's this blob downloader project I've had on my mind for
several years.  The idea is to take a list of SHA-1 or Bitprint URNs
and download them from wherever they can be found.  ContentCouch can
do this, but it takes forever because it only checks one repository at
a time and keeps using the last one that it had success downloading
from, even if that happens to be over a really slow link.  And I wrote
that program when I had been reading about Netkernel and it's full of
overly componentized and resourcified code that seemed a good idea at
the time but turns out to be completely unnecessary levels of
abstraction and makes the program pretty much impossible to extend or
debug.  So this project has been calling for a fresh start.

The new version needs to be able to take a stream of URNs on standard
input (or from a file) and distribute the requests evenly across
multiple servers as soon as possible, recover from 404s by trying
again on a different server, and, if the command-line flag is given
indicating that referenced blobs should be recursively downloaded,
scan completed downloads (or possibly even ones in progress) for URNs
to add back to the download queue.  Somehow it has to know when all
the queues are empty so the program can exit.

I did a lot of brainstorming on how the program would work.  I drew a
lot of diagrams of components with queues running between them and
explanations of what would be passed through.  I made an attempt to
write the thing in Node.js which mysteriously didn't work (I have no
idea how to structure or debug callback-laden Node code).  I made some
more diagrams and started writing Java classes that took queues as
constructor arguments and then ran into queueing problems (things
like: process A can block while pushing into queue Q, but process B
cannot) and got frustrated with how complicated things were getting
and gave up.

In these designs I always tried to either:
- Have a single thread handle all events (the Node approach), or
- Have a bunch of components that talk only via queues.

Then the other night I dreamt up what seemed like a simpler solution
while eating cereal.  Have a bunch of threads (equal to the maximum
number of connections you want to make) each take a URN to download
from a queue, attempt to download it until completed or all remote
repositories have been tried, add (non-blockingly) any referenced URNs
if recursion is enabled, then move on to the next URN from the queue.

I blew away my non-working component-based code and implemented a
downloader using the cereal model in about 4 hours.  A RepositorySet
object is in charge of locking and releasing connection slots (these
control access to remote repositories and are about 2 per), and the
download threads blockingly request slots to download from, passing in
a list of repositories that have already been determined to not
contain the blob (this request blocks only if all untried slots are in
use).  I wrote a custom queue-like class with 3 methods (put, add, and
take) that allows one to enqueue a URN blockingly (waits until the
internal list is under a certain size; this is used by the main
thread) or non-blockingly (adds it to the list unconditionally and
returns immediately; this is used by the download threads to recurse
because you don't want the download threads to deadlock) without
bothering to implement the Queue collection interface.  The whole
thing feels pretty self-contained and sits nicely in a single 441-line
Java source file.

I guess the lesson here is to not get so obsessed with cutsesy design
patterns ('everything should be controlled by an event loop',
'everything should be a component and talk via queues', 'anything that
acts like a queue should implement Queue') that you overlook simple,
effective solutions.