What do source code management and keeping track of samples in a biochemistry laboratory have in common? Quite a bit, it turns out.
There are many types of samples in the lab – tubes of purified protein or DNA, liquid bacterial cultures, solid cultures, frozen cell lines, etc. Each sample is derived from one or more other samples (for example, DNA is purified from a cell culture) and can give rise to yet more samples. Yet because there is no way that a sample can somehow be a part of its own ancestry, the sample lineage relationships form an instance of a directed acyclic graph (DAG) – the same entity that can be used to describe the revision history of a body of code.
Managing such a collection requires a good way to assign names to new members as they are generated. Most importantly, the names must be unique, and should also be drawn from some well-defined domain (e.g. “six character alphanumeric string” or “40 hex digit string”). Git famously and brilliantly names each revision with a sha1 digest calculated from the revision itself. Having names calculated from content in this way eliminates the need for synchronization in allocating names. My scheme for naming samples in the lab is more traditional in that it does rely on a sort of locking: each investigator has a pre-printed sheet of unique labels for each day, and takes one from this one sheet every time a new sample is generated. Because there is only one sheet per investigator per day, and each sheet is different, there are no name collisions.
Just as git can reconstruct the revision history of any file by walking through the DAG, I can reconstruct the history of any sample in the lab. Let’s test the efficiency of the process by looking at a typical sample:
The sample shown above is a mixture of two purified proteins (it is common for labels to include both ad-hoc descriptive language as a sort of mnemonic together with the unique code to facilitate looking up the exact history). We begin by looking up the parents of the unique code G280909E found on the sample (an O(1) operation of finding the sheet this label was taken from, in a sorted binder, and reading off the parent labels). Then we repeat the process for the sample’s parents, etc, and finally end up with this sample’s lineage:
The two samples farthest back in the ancestry are freezer stocks of bacterial strains, one corresponding to each protein found in the final sample; this is the farthest back this sample can be traced. The intermediate nodes correspond to cell culture and purification steps. Using these labels, it is possible to refer back to any notes or data and unambiguously associate it with this physical sample. So how efficient is this process? Reconstructing the above tree took nine minutes, start to finish (and I should note that this interval includes an interruption caused by a minor lab accident!) Certainly not as fast as reconstructing a file history using git, but still not too bad considering that this is one sample out of hundreds in circulation at any time.