Archive for the ‘Sample-naming algorithm’ Category


How a changeset is like a vial of bacteria

In biology,Sample-naming algorithm,source code management on December 20, 2011 by gcbenison Tagged: , , , , , ,

What do source code management and keeping track of samples in a biochemistry laboratory have in common? Quite a bit, it turns out.

There are many types of samples in the lab – tubes of purified protein or DNA, liquid bacterial cultures, solid cultures, frozen cell lines, etc. Each sample is derived from one or more other samples (for example, DNA is purified from a cell culture) and can give rise to yet more samples. Yet because there is no way that a sample can somehow be a part of its own ancestry, the sample lineage relationships form an instance of a directed acyclic graph (DAG) – the same entity that can be used to describe the revision history of a body of code.

Managing such a collection requires a good way to assign names to new members as they are generated. Most importantly, the names must be unique, and should also be drawn from some well-defined domain (e.g. “six character alphanumeric string” or “40 hex digit string”). Git famously and brilliantly names each revision with a sha1 digest calculated from the revision itself. Having names calculated from content in this way eliminates the need for synchronization in allocating names. My scheme for naming samples in the lab is more traditional in that it does rely on a sort of locking: each investigator has a pre-printed sheet of unique labels for each day, and takes one from this one sheet every time a new sample is generated. Because there is only one sheet per investigator per day, and each sheet is different, there are no name collisions.

Just as git can reconstruct the revision history of any file by walking through the DAG, I can reconstruct the history of any sample in the lab. Let’s test the efficiency of the process by looking at a typical sample:

A sample tube, displaying unique name

The sample shown above is a mixture of two purified proteins (it is common for labels to include both ad-hoc descriptive language as a sort of mnemonic together with the unique code to facilitate looking up the exact history). We begin by looking up the parents of the unique code G280909E found on the sample (an O(1) operation of finding the sheet this label was taken from, in a sorted binder, and reading off the parent labels). Then we repeat the process for the sample’s parents, etc, and finally end up with this sample’s lineage:

The two samples farthest back in the ancestry are freezer stocks of bacterial strains, one corresponding to each protein found in the final sample; this is the farthest back this sample can be traced. The intermediate nodes correspond to cell culture and purification steps. Using these labels, it is possible to refer back to any notes or data and unambiguously associate it with this physical sample. So how efficient is this process? Reconstructing the above tree took nine minutes, start to finish (and I should note that this interval includes an interruption caused by a minor lab accident!) Certainly not as fast as reconstructing a file history using git, but still not too bad considering that this is one sample out of hundreds in circulation at any time.



How I name my samples

In Sample-naming algorithm on June 12, 2011 by gcbenison

In programming and in other things, sometimes half your job comes down to picking good names for things

Working in a biochemistry lab means constantly generating new samples- sometimes dozens per day.  Some of these samples become the source of multiple data points.  Some samples end up being archived for months or even years.  Projects are interleaved in time.  All of this requires that every researcher adopt a system for attaching names to samples.  There is no choice here – you have a system, even if you didn’t consciously adopt it!

I want my sample names to meet these four criteria:

  • Unique – the name must be obviously distinct from thousands of others, and remain unique over time
  • Terse – long, descriptive names won’t fit on 1 cm.  tubes
  • Typical – a sample identifier should stand out in my notes as such, amid all the other writing
  • Easily generated – at the time of sample generation I want to spend zero effort and time on picking the name and ensuring that it meets the other three criteria

I solve the problem by choosing sample identifiers from a namespace composed of: the letter ‘G’ + the date in ‘DDMMYY’ format + one letter [A-Z].  For example, G120611A, G120611B, etc.  This provides 24 unique sample names per day.  I print them out on daily sample manifests with one unique name per line with space to pencil in a more verbose description:

Daily sample manifest

To assign a name to a sample, I just pick the next empty line on the manifest, and I have confidence that the chosen name is unique and will remain so. Keeping these manifests in a folder provides a succinct running record of all samples I have generated. The names are terse (8 characters) and easily fit on all common lab containers (note that I can and usually do include descriptive information on containers in addition to the unique identifier.)  I include an extra copy of the identifier on each line so I can cut it out and paste it to the physical sample.  Of course it can also be written with a Sharpie:

Eppie tube with unique labelUnique names on Falcon tubes

Having the unique identifier on the physical sample makes it easy to look up information about it in my notebook, and to be sure that the notes refer to that exact sample:

Notebook page

Here is a shell script that will generate sample manifests for the next 30 days:

# Generate pages with 24 unique sample names per page, one name per line,
# for the next 30 days.
# GCB 12jun2011


# header section
cat <<EOF
/title-font /Helvetica findfont 13 scalefont def
/mini-font  /Helvetica findfont 7  scalefont def
/std-font   /Helvetica findfont 10 scalefont def

/inch {72 mul} def

/page-height 9.5 inch def
/page-width 7.0 inch def
/n-rows 26 def
/v-delta page-height 20 sub n-rows div def

/title {
  1.1 inch dup translate

  0 page-height 0.5 inch sub translate
  0 0 moveto
  title-font setfont show

  2 inch 0 moveto
  std-font setfont
  (Mellies Lab, Reed College; unique sample labels) show
} def

/label {
  /txt exch def
  0 v-delta -1 mul translate

  0 0 moveto
  std-font setfont
  txt show

  0 -5 moveto
  0.6 setlinewidth
  page-width 0 rlineto stroke

  0.3 setgray
  (G666666A) stringwidth pop 10 add -5 moveto
  0 20 rlineto stroke

  mini-font setfont
  page-width (G666666A) stringwidth pop 2 mul sub 0 moveto
  txt show
  10 0 rmoveto
  txt show

} def


for delta in `seq -4 $n_days`
  date "+(%a, %b %d %Y) title" --d "+$delta days";
  for idx in A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    date "+(G%d%m%y$idx) label" --d "+$delta days";
  echo "showpage";


A high-risk, high-return investment

In Sample-naming algorithm on November 8, 2009 by gcbenison

A colleague of mine – like me, a protein scientist – recently remarked that he had “spent all week making protein that fits in a tube half the size of your pinky finger”.

It’s an accurate description- even today, with all of the tools available, it takes about one person-week of skilled labor to make a sample of pure protein starting from scratch.  The process involves a series of steps, each of which must succeed, otherwise at the end of the week instead of a pinky-sized sample of protein you will have exactly nothing to show for your efforts.

In this way bench work is different from anything else I do as a scientist – such as writing manuscripts, reading, or debugging code – where a couple of hours invested is almost certain to result in something tangible being accomplished.  Only in bench work do I need to make a large investment of time not knowing if or when it will pay off.  The return on time invested in protein purification can be very high, though, especially for a novel protein.  Every protein is different, and know-how in working with a particular one can be a lab’s competitive edge, providing a foundation for a successful research program lasting years.  Successful collaborations can be built entirely on a lab’s ability to produce a desired protein, which traces back to that time initially invested at the lab bench.

So given the possibility of high return, how can one manage the risk associated with bench work?  How can I ever summon the courage to start a new protein purification, knowing that I may be about to embark on a colossal waste of time?

The answer, for me, is this: “The protein is not the product; the process is the product.”  Even in the course of a failed protein purification, every step generates a wealth of information about how that protein behaves under certain conditions.  If you capture that valuable information,  so that your next attempt can be better, your time will not have been wasted.  If, on the other hand, you focus only on the end product and let potential lessons fade away,  you have wasted your time indeed.

So how can you capture the lessons you learn during a purification?  It is not enough just to believe in the idea – you need a method.  In an academic lab, we usually don’t have the budget or the formality to use a commercial LIMS system.  Instead we have something more ad-hoc, often no more than the guideline to “write things down in your lab notebook”.  That’s pretty vague, leaving a lot of lattitude for people to develop their own systems for keeping track of samples.  In coming posts, I’ll share my system, and why I chose it.