Archive for the ‘science’ Category


DNA sequence annotation: a graph coloring problem

In algorithms,biology,functional programming,haskell,science on June 18, 2012 by gcbenison Tagged: , , ,

In this post I will explore methods for annotating DNA sequences with translated ORFs, like so:

      1870      1880      1890      1900      1910      1920

Creating such annotated diagrams is an application of graph coloring.  My code to produce them is  free software.


DNA is a lot like RAM, except instead of bits coming together in groups of eight to form bytes, DNA bases (of which there are four – A, C, T, and G) come together in groups of three to form codons.  The sequence of codons in turn determines the sequence of amino acids that make up a protein molecule.  The function that maps codons to amino acids (which is notably neither one-to-one nor onto) is called the genetic code, and its deciphering was a major intellectual achievement of the twentieth century.

The analogy between bits and DNA bases breaks down in one important way: the notion of alignment.  The one thing that all computer vendors seem to agree on is that a byte consists of eight bits, that each bit clearly “belongs to” exactly one byte, and that it has a defined place within that byte.  No such thing is true of DNA.  To the molecular machinery that reads it, DNA is just a stream of single bases.  The special sequence of three bases ‘A’, ‘T’, ‘G’ – the “start codon” – tells the DNA-reading machinery to start grouping succeeding triples of bases into codons.  The sequence of codons initiated by the start codon – and then terminated by a sequence known as a “stop codon” – is called an open reading frame (ORF).  Thus each base cannot be inherently the “least significant” or “most significant”, the way that a bit can.  Rather, the significance of a base within a codon is determined entirely by its context within the surrounding frame defined by start and stop codons.  The same base can even simultaneously form part of multiple codons, with a different meaning in each, if it occurs within overlapping reading frames:

        10        20        30        40        50      

In the example above, the ‘C’ base at position 16 occurs in two different contexts: as the final base in the ‘Cys’ codon within the ORF that starts at position 8, and as the first base in the ‘Pro’ codon within the ORF that starts at position 13.

ORF annotation as an application of graph coloring

A display such as the one in the example above, where a DNA sequence is displayed annotated with the ORFs it contains and their translations, is very useful in planning genetic engineering experiments becuase it allows you to edit a DNA sequence and visualize the effects this will have on the encoded proteins.  ORFs that overlap must be displayed on separate lines.  So a program that generates such displays must contain an algorithm to decide on which line to print each ORF.  This deceptively simple-looking task is special case of the graph coloring problem, and another example of how abstract concepts from graph theory tend to show up in everyday contexts.  Let each ORF be the vertex in a graph which contains an edge between any two overlapping ORF’s.  The problem of assigning ORFs to display lines is equivalent to the problem of assigning a color to each node such that no two connected nodes share the same color.   The graph coloring problem is also the key to register allocation, optimal bus scheduling, and many other things.  And it is hard to do well. It is trivially easy to find a coloring of a graph – just assign a different color to each node.  (In the DNA application, just put each ORF on its own line).  It is much harder to find a minimal coloring – a coloring using the fewest possible distinct colors, which in the DNA case corresponds to using the fewest possible display lines to contain all the ORFs.

Comparing two graph coloring algorithms

Somewhat surprisingly, there is no algorithm to solve the general minimal graph coloring problem efficiently (i.e. with polynomial time complexity).  There are many heuristics though that approximate the minimal solution.  For example, the greedy coloring algorithm can be stated: “for each vertex v_i in V, assign the lowest color not assigned to any of its neighbors.”  In Haskell, this can be implemented as a fold over a list of vertices, where the accumulator variable is a map from vertices to colors that is initially empty:

-- Given a vertex list and adjacency list, return a graph coloring
-- i.e. mapping from vertices to colors where colors = [1..]
graphColor::(Ord a)=>[a]->Map a [a]->Map a Int
graphColor nodes neighbors =
  foldl assignColor Data.Map.empty nodes
    where assignColor colors node =
            case Data.Map.lookup node neighbors of
              Nothing -> Data.Map.insert node 1 colors
              Just ns -> let next_color = lowestSlot $ map ((flip Data.Map.lookup) colors) ns
                         in Data.Map.insert node next_color colors

lowestSlot::[Maybe Int]->Int
lowestSlot xs = foldl h 1 $ sort xs
  where h best Nothing = best
        h best (Just x) | best == x = best + 1
                        | otherwise = best

There are many ways to represent graphs, with several interesting implementations proposed for Haskell.  The greedy algorithm above requires that an adjacency list (a map from a node to its neighbors) be available for each node.  I’ve chosen to use the basic Data.Map type with nodes as keys and lists of neighbors as values.    Because I want to focus on the coloring algorithm, I will not discuss further the steps for creating such adjacency lists and instead refer readers to the full source code. The greedy algorithm is quite efficient – O(N) time complexity – but can yield colorings that are far from minimal, depending on the input order of the vertices.  So when given ORFs in the order in which they occur in the DNA sequence, how well does the greedy algorithm perform?

To answer that question, I’ll compare to a slower yet more exhaustive graph coloring algorithm that I’ll call the “extended greedy” algorithm, which can be stated as: “For every remaining uncolored vertex, if it has no neighbor assigned to the current color, assign it to that color.  Repeat for the next color.”  In Haskell it can be implemented like this:

-- Partition vertices into color groups; 'connected' is a function
-- that tells if an edge exists between two vertices
extendedGreedy _ [] = [[]]
extendedGreedy connected xs =
  let (first, rest) = foldl (\(first, rest) next ->
                              if (not $ any (connected next) first)
                                then ((next:first), rest)
                                else (first, (next:rest)))
                            ([],[]) xs
  in first:(extendedGreedy connected rest)

The extended greedy algorithm is more time-intensive (consider the case where no ORFs overlap.  Then each vertex is still compared with every other so time complexity is at least O(n^2).  It gets worse when there is overlap.)  But it also seems more thorough and more likely to find a minimal coloring.  So how do the two methods compare, both in terms of performance and in terms of finding a minimal coloring?  I ran both against a series of four benchmarks derived from the genome of E. coli strain e2348/69, an organism I work with in the laboratory: first, the plasmid pGFPE-Ler, a small piece of DNA containing just a few genes.  Second, a genomic region responsible for pathogenesis called the LEE comprising about 2% of the genome.  Third, the first 400,000 base pairs, or about 10% of the genome.  And finally, the entire e2348/69 genome.

Greedy algorithm Extended Greedy
Sequence base pairs ORF’s Run time, seconds Chromatic number Run time, seconds Chromatic number
pGFPE-Ler 6362 84 0.01 10 0.01 10
LEE 102512 1079 0.24 26 0.2 26
e2348/69; first 400k 400000 5170 0.85 39 0.9 39
e2348/69 5026578 61255 113 75 11 75

The run times demonstrate the linear time complexity of the greedy algorithm and the worse time complexity of the extended greedy algorithm, which becomes really apparent for the largest benchmark.  However the extended algorithm appears to be doing no better at finding minimal colorings.  It is known that there is an order of input nodes such that the linear greedy algorithm will find an optimal coloring.  Perhaps what these results reveal is that the ORFs sorted in the order in which they occur is such an ideal ordering, but I do not now know how to prove that.  Perhaps a topic for a later post…



The logic bomb that helped me embrace free scientific software

In Free software,science on May 15, 2012 by gcbenison Tagged: , , , ,

Scientists often write custom, domain-specific computer programs as part of their research.  But the source code to those programs – when it is released at all – is commonly released under restrictive licenses that prevent redistribution or modification.  Recent editorials in the two most prominent scientific journals arguing for greater sharing of scientific source code have drawn some long-overdue attention to this issue (Reviews of the editorials are freely available online, but both original articles – even while arguing for more openness in science – sit behind massive paywalls, an irony that has been picked up by other observers as well.  Open access is an important issue, but it is a topic for another post.)  The common practice of publishing scientific conclusions based on code that is not itself made public has long troubled me.  One incident in particular brought the issue into sharp focus in my mind.

A defining moment

In the summer of 2003 I spent two months in Puerto Rico to be with my wife, who was doing her dissertation research on shrimp that live in the streams there.  I was a graduate student too at the time, trying to finish my own degree in chemistry.  I knew that during the trip I would have a lot of “down time” holed up in our apartment, which I planned to spend analysing data I had accumulated over several years. Before heading for the tropics, I made sure I had all of my data with me, and the tools I needed to work with it.

One of those tools was a program for translating the raw Nuclear Magnetic Resonance (NMR) data generated by instruments into a more useful representation.  This program, which had been written by scientists working at the National Institutes of Health, was available gratis to researchers and widely used in my field.  I relied on this program, so before leaving for the tropics I made sure that my copy was good.  So, it came as a great surprise several days into my visit when the program abruptly stopped working: it contained a logic bomb. Past a certain date, it would refuse to start normally, instead attempting to send you an email telling you to download the next release.

The program’s authors would claim that the logic bomb was innocuous, that it’s only purpose was to reduce the burden of supporting multiple versions by keeping all users in sync.  They promised to always provide the next release for free, and they assumed that most users would be in a university setting with a high-speed internet connection, so that the forced upgrade would be a simple matter completed in a couple of minutes.  But in my case their assumptions turned out to be wrong.  The nearest internet connection was a half day’s drive away, slow, and unreliable.  I could not reasonably ask my wife to interrupt her field work schedule so that I could attempt an upgrade, which might not even succeed.  Besides, I had planned in advance to have everything I needed with me, and was annoyed at having my plans derailed by this logic bomb.  And despite having programming skills and a decent development environment available to me, I could not solve my problem simply by editing the program’s source, because I did not have it – the authors declined to distribute that.  That wasn’t some kind of oversight on their part: in a startup message displayed by the program, the authors made clear their intention to prevent any redistribution or modification by users, even to solve their own problems:

“This program and its related software is based  on  a  design  originating in the NIH Laboratory of Chemical Physics, NIDDK.  It is not to be distributed or modified.”

This was a radicalizing, formative moment for me, my equivalent of the encounter between the young Richard Stallman and a malfunctioning laser printer.  I was already a user of Linux and GNU and I had started to learn about the Free Software movement, but it took this incident to show me clearly how attempting to control the behavior of users – even in benign, relatively mild ways like forced upgrades – can have unintended consequences.  And I began to question whether the motivation behind the restrictions on this code really were so benign.  After all, the program’s authors were also my potential competitors.  They said they would always provide future releases freely, but nothing required that they keep this promise. What if they one day decided to stop providing their upgrades?  What if they only provided it to certain people of their choosing?  The user community that had built up around this program was dependent on its original authors in a way that is unusual in science.  By banning me or anyone else from building on their work, and publishing the results, they were ensuring that any advances would come from them. It’s normal in science to care deeply about getting proper credit for one’s work and to reserve the right to choose the time and place of publication.  But it is not normal, once you’ve published a method, to try to prevent others from reading about it, modifying it, and publishing their modifications.  Not normal, that is, unless the method happens to be in the form of a computer program, which is just a method of converting input to output.

Broader questions aside, I still was faced with the immediate problem of not being able to do my work.  Not ready to accept the ethical or legal validity of the restrictions that had been placed on me, I set out to solve my own problem however I could.  And as it so often turns out, these programmers’ desire to control their users’ behavior through technical means exceeded their technical ability to do so.

Disabling the logic bomb

I quickly figured out that I could get the program to run by setting the system clock back a month.  But this was inconvenient, led to incorrect time stamps on things, and led to all sorts of other problems.  So I set out to circumvent the logic bomb in the program itself.  I was a chemistry student, not a computer science student, but even at that stage in my career I at least knew enough about programming to surmise that the program might call “time()” somewhere and compare the result to a fixed value.  I fired up the program in a debugger, and saw this:

#0  0xb7ed10f5 in time () from /lib/
#1  0x0809a6c3 in pastDate2 ()
#2  0x0809a5b9 in pastDate ()
#3  0x0804c285 in main ()

OK, not a whole lot of subterfuge there.  What if I just made “pastDate()” return right away?  I did this:

 0809a5a4 :
  809a5a4:      55                      push   %ebp
  809a5a5:      89 e5                   mov    %esp,%ebp
- 809a5a7:      53                      push   %ebx
- 809a5a8:      83 ec 08                sub    $0x8,%esp
- 809a5ab:      68 d6 07 00 00          push   $0x7d6
+ 809a5a7:      c9                      leave  
+ 809a5a8:      c3                      ret    
+ 809a5a9:      ec                      in     (%dx),%al
+ 809a5aa:      08 68 d6                or     %ch,-0x2a(%eax)
+ 809a5ad:      07                      pop    %es
+ 809a5ae:      00 00                   add    %al,(%eax)
  809a5b0:      6a 01                   push   $0x1
  809a5b2:      6a 08                   push   $0x8
  809a5b4:      e8 d3 00 00 00          call   809a68c

And it worked – it took a change of two bytes to disable the logic bomb.  Incidentally, I continued using the program for several more years and never did upgrade.  I never saw the need to.

Why is scientific software so closed?

Every field of science today is dependent on computation, but research being what it is, there are often computational needs that can’t be met by any widely-available, well-known tool.  So scientists who know some programming tend to write these niche, domain-specific applications themselves.  But scientists who know some programming often aren’t very familiar with best practices in software development, and that includes licensing – many scientists, even those who program, have never heard of the GPL.

As a result, such niche scientific programs, when they are released at all, are often released under custom licenses that were written without much thought.  And these custom licenses, more often than not, are quite restrictive.  Why?  There are several reasons.  Some scientists, conscious of their limitations as programmers, don’t want their source code seen out of fear of embarrassment (reputation is everything in science.)  Some are concerned about the potential extra work of supporting a user base if their programs are distributed. Some scientists – or their institution’s intellectual property office
– may be holding on to the hope of some day profiting from the code, though this hope is forlorn for most niche, domain-specific code that is only of interest to a handful of colleagues.  Some may actually want to make it harder for others to build on their work, and thus increase their own standing.  That kind of territorial attitude is out of place in science and can actually backfire.

I don’t know if the GPL is the answer to everything in scientific licensing, though I have used it for my own most significant scientific software.  I do know that it should be used more often than it is, and that scientific licensing in general right now tends to be too restrictive.  And I know that the scientific community isn’t thinking about licensing enough.  Even the editorials in Science and Nature are another illustration of this.  They call for distribution of source code that underlies published results.  But access to source code is only part of what software licensing is about.  What about derivative works?  What about warranties?  What about redistribution?

Science depends on the exchange of data, information, and ideas.  In a sense that’s what science is.  More and more, computers are what we use to do this.  So we need to be thinking about standards of behavior for how we use them.  “The code is mine, all mine” won’t cut it.