Archive for the ‘biology’ Category


DNA sequence annotation: a graph coloring problem

In algorithms,biology,functional programming,haskell,science on June 18, 2012 by gcbenison Tagged: , , ,

In this post I will explore methods for annotating DNA sequences with translated ORFs, like so:

      1870      1880      1890      1900      1910      1920

Creating such annotated diagrams is an application of graph coloring.  My code to produce them is  free software.


DNA is a lot like RAM, except instead of bits coming together in groups of eight to form bytes, DNA bases (of which there are four – A, C, T, and G) come together in groups of three to form codons.  The sequence of codons in turn determines the sequence of amino acids that make up a protein molecule.  The function that maps codons to amino acids (which is notably neither one-to-one nor onto) is called the genetic code, and its deciphering was a major intellectual achievement of the twentieth century.

The analogy between bits and DNA bases breaks down in one important way: the notion of alignment.  The one thing that all computer vendors seem to agree on is that a byte consists of eight bits, that each bit clearly “belongs to” exactly one byte, and that it has a defined place within that byte.  No such thing is true of DNA.  To the molecular machinery that reads it, DNA is just a stream of single bases.  The special sequence of three bases ‘A’, ‘T’, ‘G’ – the “start codon” – tells the DNA-reading machinery to start grouping succeeding triples of bases into codons.  The sequence of codons initiated by the start codon – and then terminated by a sequence known as a “stop codon” – is called an open reading frame (ORF).  Thus each base cannot be inherently the “least significant” or “most significant”, the way that a bit can.  Rather, the significance of a base within a codon is determined entirely by its context within the surrounding frame defined by start and stop codons.  The same base can even simultaneously form part of multiple codons, with a different meaning in each, if it occurs within overlapping reading frames:

        10        20        30        40        50      

In the example above, the ‘C’ base at position 16 occurs in two different contexts: as the final base in the ‘Cys’ codon within the ORF that starts at position 8, and as the first base in the ‘Pro’ codon within the ORF that starts at position 13.

ORF annotation as an application of graph coloring

A display such as the one in the example above, where a DNA sequence is displayed annotated with the ORFs it contains and their translations, is very useful in planning genetic engineering experiments becuase it allows you to edit a DNA sequence and visualize the effects this will have on the encoded proteins.  ORFs that overlap must be displayed on separate lines.  So a program that generates such displays must contain an algorithm to decide on which line to print each ORF.  This deceptively simple-looking task is special case of the graph coloring problem, and another example of how abstract concepts from graph theory tend to show up in everyday contexts.  Let each ORF be the vertex in a graph which contains an edge between any two overlapping ORF’s.  The problem of assigning ORFs to display lines is equivalent to the problem of assigning a color to each node such that no two connected nodes share the same color.   The graph coloring problem is also the key to register allocation, optimal bus scheduling, and many other things.  And it is hard to do well. It is trivially easy to find a coloring of a graph – just assign a different color to each node.  (In the DNA application, just put each ORF on its own line).  It is much harder to find a minimal coloring – a coloring using the fewest possible distinct colors, which in the DNA case corresponds to using the fewest possible display lines to contain all the ORFs.

Comparing two graph coloring algorithms

Somewhat surprisingly, there is no algorithm to solve the general minimal graph coloring problem efficiently (i.e. with polynomial time complexity).  There are many heuristics though that approximate the minimal solution.  For example, the greedy coloring algorithm can be stated: “for each vertex v_i in V, assign the lowest color not assigned to any of its neighbors.”  In Haskell, this can be implemented as a fold over a list of vertices, where the accumulator variable is a map from vertices to colors that is initially empty:

-- Given a vertex list and adjacency list, return a graph coloring
-- i.e. mapping from vertices to colors where colors = [1..]
graphColor::(Ord a)=>[a]->Map a [a]->Map a Int
graphColor nodes neighbors =
  foldl assignColor Data.Map.empty nodes
    where assignColor colors node =
            case Data.Map.lookup node neighbors of
              Nothing -> Data.Map.insert node 1 colors
              Just ns -> let next_color = lowestSlot $ map ((flip Data.Map.lookup) colors) ns
                         in Data.Map.insert node next_color colors

lowestSlot::[Maybe Int]->Int
lowestSlot xs = foldl h 1 $ sort xs
  where h best Nothing = best
        h best (Just x) | best == x = best + 1
                        | otherwise = best

There are many ways to represent graphs, with several interesting implementations proposed for Haskell.  The greedy algorithm above requires that an adjacency list (a map from a node to its neighbors) be available for each node.  I’ve chosen to use the basic Data.Map type with nodes as keys and lists of neighbors as values.    Because I want to focus on the coloring algorithm, I will not discuss further the steps for creating such adjacency lists and instead refer readers to the full source code. The greedy algorithm is quite efficient – O(N) time complexity – but can yield colorings that are far from minimal, depending on the input order of the vertices.  So when given ORFs in the order in which they occur in the DNA sequence, how well does the greedy algorithm perform?

To answer that question, I’ll compare to a slower yet more exhaustive graph coloring algorithm that I’ll call the “extended greedy” algorithm, which can be stated as: “For every remaining uncolored vertex, if it has no neighbor assigned to the current color, assign it to that color.  Repeat for the next color.”  In Haskell it can be implemented like this:

-- Partition vertices into color groups; 'connected' is a function
-- that tells if an edge exists between two vertices
extendedGreedy _ [] = [[]]
extendedGreedy connected xs =
  let (first, rest) = foldl (\(first, rest) next ->
                              if (not $ any (connected next) first)
                                then ((next:first), rest)
                                else (first, (next:rest)))
                            ([],[]) xs
  in first:(extendedGreedy connected rest)

The extended greedy algorithm is more time-intensive (consider the case where no ORFs overlap.  Then each vertex is still compared with every other so time complexity is at least O(n^2).  It gets worse when there is overlap.)  But it also seems more thorough and more likely to find a minimal coloring.  So how do the two methods compare, both in terms of performance and in terms of finding a minimal coloring?  I ran both against a series of four benchmarks derived from the genome of E. coli strain e2348/69, an organism I work with in the laboratory: first, the plasmid pGFPE-Ler, a small piece of DNA containing just a few genes.  Second, a genomic region responsible for pathogenesis called the LEE comprising about 2% of the genome.  Third, the first 400,000 base pairs, or about 10% of the genome.  And finally, the entire e2348/69 genome.

Greedy algorithm Extended Greedy
Sequence base pairs ORF’s Run time, seconds Chromatic number Run time, seconds Chromatic number
pGFPE-Ler 6362 84 0.01 10 0.01 10
LEE 102512 1079 0.24 26 0.2 26
e2348/69; first 400k 400000 5170 0.85 39 0.9 39
e2348/69 5026578 61255 113 75 11 75

The run times demonstrate the linear time complexity of the greedy algorithm and the worse time complexity of the extended greedy algorithm, which becomes really apparent for the largest benchmark.  However the extended algorithm appears to be doing no better at finding minimal colorings.  It is known that there is an order of input nodes such that the linear greedy algorithm will find an optimal coloring.  Perhaps what these results reveal is that the ORFs sorted in the order in which they occur is such an ideal ordering, but I do not now know how to prove that.  Perhaps a topic for a later post…



How a changeset is like a vial of bacteria

In biology,Sample-naming algorithm,source code management on December 20, 2011 by gcbenison Tagged: , , , , , ,

What do source code management and keeping track of samples in a biochemistry laboratory have in common? Quite a bit, it turns out.

There are many types of samples in the lab – tubes of purified protein or DNA, liquid bacterial cultures, solid cultures, frozen cell lines, etc. Each sample is derived from one or more other samples (for example, DNA is purified from a cell culture) and can give rise to yet more samples. Yet because there is no way that a sample can somehow be a part of its own ancestry, the sample lineage relationships form an instance of a directed acyclic graph (DAG) – the same entity that can be used to describe the revision history of a body of code.

Managing such a collection requires a good way to assign names to new members as they are generated. Most importantly, the names must be unique, and should also be drawn from some well-defined domain (e.g. “six character alphanumeric string” or “40 hex digit string”). Git famously and brilliantly names each revision with a sha1 digest calculated from the revision itself. Having names calculated from content in this way eliminates the need for synchronization in allocating names. My scheme for naming samples in the lab is more traditional in that it does rely on a sort of locking: each investigator has a pre-printed sheet of unique labels for each day, and takes one from this one sheet every time a new sample is generated. Because there is only one sheet per investigator per day, and each sheet is different, there are no name collisions.

Just as git can reconstruct the revision history of any file by walking through the DAG, I can reconstruct the history of any sample in the lab. Let’s test the efficiency of the process by looking at a typical sample:

A sample tube, displaying unique name

The sample shown above is a mixture of two purified proteins (it is common for labels to include both ad-hoc descriptive language as a sort of mnemonic together with the unique code to facilitate looking up the exact history). We begin by looking up the parents of the unique code G280909E found on the sample (an O(1) operation of finding the sheet this label was taken from, in a sorted binder, and reading off the parent labels). Then we repeat the process for the sample’s parents, etc, and finally end up with this sample’s lineage:

The two samples farthest back in the ancestry are freezer stocks of bacterial strains, one corresponding to each protein found in the final sample; this is the farthest back this sample can be traced. The intermediate nodes correspond to cell culture and purification steps. Using these labels, it is possible to refer back to any notes or data and unambiguously associate it with this physical sample. So how efficient is this process? Reconstructing the above tree took nine minutes, start to finish (and I should note that this interval includes an interruption caused by a minor lab accident!) Certainly not as fast as reconstructing a file history using git, but still not too bad considering that this is one sample out of hundreds in circulation at any time.


Contigs, in Haskell.

In biology,haskell on October 27, 2011 by gcbenison Tagged: ,

A theme of this blog is that one of the best ways to understand programming algorithms is by analogies with everyday life. Today’s analogy comes from a computational problem that arose from efforts in recent decades to sequence entire genomes. A reminder: a genome consists of one or more strings of DNA bases, where each DNA base or character in the string is chosen from the set (A, T, G, C). A chromosome (or single string of DNA) can be millions of characters long. No currently-existing technology can read a sequence that long; several hundred characters at a time is more typical.

So imagine receiving a message that had been cut up into little fragments:

man who loved the outdoors, and bowling…
an. He was…he was one of us. He was a man w
ne of us. He was a man who loved the outdoors
and a good man. He was…he was one of us. H
e was a man who loved the outdoors, and bowli
was…he was one of us. He was a man who love
loved the outdoors, and bowling…
od man. He was…he was one of us. He was a m


Reassembling the original would be very difficult if just a single copy of the message had been cut up, leading to disjoint segments. But if multiple copies had been cut up, there would be fragments with overlapping portions – with enough matches, you could reassemble the original. And this is how whole genome sequences are currently solved. This leads to an important data structure for modern biology, the contig: a set of strings each associated with an offset (usually chosen to align all of the strings).

Here is a ‘contig’ data type defined in Haskell:

import Data.List

data Contig = Contig [(String, Int)]

indent 0 = ""
indent x = " " ++ (indent $ x - 1)

instance Show Contig where
  show (Contig xs) = unlines $ "" : (map indentContig $
                     sortBy (\a b -> compare (snd a) (snd b)) xs)
    where indentContig (str, x) = (indent x) ++ str

*Main> Contig [("lo world", 3),("hello", 0)]

   lo world

Building a contig requires a good choice of alignment algorithms: the fragments may contain mistakes especially at their ends, and there is usually more than one alignment possible. However with this small example even a rudimentary alignment algorithm (see below) starts to reveal the original message:

*Main> assemble fragments

 and a good man. He was...he was one of us. H
             an. He was...he was one of us. He was a man w
                    was...he was one of us. He was a man who love
                                  ne of us. He was a man who loved the outdoors
                                        us. He was a man who loved the outdoors, and
                                             e was a man who loved the outdoors, and bowli
                                                     man who loved the outdoors, and bowling...

The following is a rudimentary, brute-force algorithm for building contigs in Haskell. For genome-length data it is necessary to use much better methods for string alignment, and I will have more on that later…

interleave (x:xs)(y:ys) = x:y:(interleave xs ys)
interleave xs [] = xs
interleave [] ys = ys

firstThat::(a -> Maybe b) -> [a] -> Maybe b
firstThat op [] = Nothing
firstThat op (x:xs) = proceed candidate
  where candidate = op x
        proceed (Just y) = Just y
        proceed Nothing = firstThat op xs

-- True if either string is a prefix of the other.
isPrefix [] _ = True
isPrefix _ [] = True
isPrefix (x:xs)(y:ys) | x == y = isPrefix xs ys
isPrefix _ _ = False

matchWithOffset offset str target
      | offset < (- (length str)) = False
      | offset  (length target) = False
      | True = isPrefix str $ drop offset target

matchWithAnyOffset::String->String->Maybe Int
matchWithAnyOffset str target =
  find (\offset -> matchWithOffset offset str target) $
  interleave [0..(length target - 1)] $ map negate [1..(length str - 1)]
matchWithAnyOffsetPlus::String->(String, Int)->Maybe Int
matchWithAnyOffsetPlus str (target, target_offset) = op offset
  where op Nothing = Nothing
        op (Just x) = Just (x + target_offset)
        offset = matchWithAnyOffset str target
appendContig::Contig->String->Maybe Contig
appendContig (Contig []) str = Just $ Contig [(str, 0)]
appendContig (Contig xs) str = op $ firstThat (matchWithAnyOffsetPlus str) xs
  where op Nothing = Nothing
        op (Just x) | x > 0 = Just $ Contig ((str, x):xs)
        op (Just x) = Just $ Contig ((str, 0):(map (\(s,i)->(s,i-x)) xs))

assemble::[String]->Maybe Contig
assemble = foldr helper1 (Just $ Contig [])
           where helper1 str (Just c) = helper2 (appendContig c str) c
                 helper2 Nothing c = (Just c)
                 helper2 x _ = x