In this post I will explore methods for annotating DNA sequences with translated ORFs, like so:

1870 1880 1890 1900 1910 1920 TGCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCAGGAAGAACTGGCGCAAGCCA euLeuThrAsnValHisGlyLeuAsnLeuGluAsnTrpGlnGluGluLeuAlaGlnAlaL MetAlaLeuIleTrpLysThrGlyArgLysAsnTrpArgLysPro MetTyrMetAlaLeuIleTrpLysThrGlyArgLysAsnTrpArgLysPro

Creating such annotated diagrams is an application of graph coloring. My code to produce them is free software.

## Background

DNA is a lot like RAM, except instead of bits coming together in groups of eight to form bytes, DNA bases (of which there are four - A, C, T, and G) come together in groups of three to form codons. The sequence of codons in turn determines the sequence of amino acids that make up a protein molecule. The function that maps codons to amino acids (which is notably neither one-to-one nor onto) is called the genetic code, and its deciphering was a major intellectual achievement of the twentieth century.

The analogy between bits and DNA bases breaks down in one important way: the notion of alignment. The one thing that all computer vendors seem to agree on is that a byte consists of eight bits, that each bit clearly "belongs to" exactly one byte, and that it has a defined place within that byte. No such thing is true of DNA. To the molecular machinery that reads it, DNA is just a stream of single bases. The special sequence of three bases 'A', 'T', 'G' - the "start codon" - tells the DNA-reading machinery to start grouping succeeding triples of bases into codons. The sequence of codons initiated by the start codon - and then terminated by a sequence known as a "stop codon" - is called an open reading frame (ORF). Thus each base cannot be inherently the "least significant" or "most significant", the way that a bit can. Rather, the significance of a base within a codon is determined entirely by its context within the surrounding frame defined by start and stop codons. The same base can even simultaneously form part of multiple codons, with a different meaning in each, if it occurs within overlapping reading frames:

10 20 30 40 50 TTGGTAATGTTATGCCGAAGGCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT MetLeuCysArgArgProPhePhePhePhePhePhePhePhePhePhe MetProLysAlaLeuPhePhePhePhePhePhePhePhePhePhe

In the example above, the 'C' base at position 16 occurs in two different contexts: as the final base in the 'Cys' codon within the ORF that starts at position 8, and as the first base in the 'Pro' codon within the ORF that starts at position 13.

## ORF annotation as an application of graph coloring

A display such as the one in the example above, where a DNA sequence is displayed annotated with the ORFs it contains and their translations, is very useful in planning genetic engineering experiments becuase it allows you to edit a DNA sequence and visualize the effects this will have on the encoded proteins. ORFs that overlap must be displayed on separate lines. So a program that generates such displays must contain an algorithm to decide on which line to print each ORF. This deceptively simple-looking task is special case of the graph coloring problem, and another example of how abstract concepts from graph theory tend to show up in everyday contexts. Let each ORF be the vertex in a graph which contains an edge between any two overlapping ORF's. The problem of assigning ORFs to display lines is equivalent to the problem of assigning a color to each node such that no two connected nodes share the same color. The graph coloring problem is also the key to register allocation, optimal bus scheduling, and many other things. And it is hard to do well. It is trivially easy to find *a coloring* of a graph - just assign a different color to each node. (In the DNA application, just put each ORF on its own line). It is much harder to find a *minimal coloring* - a coloring using the fewest possible distinct colors, which in the DNA case corresponds to using the fewest possible display lines to contain all the ORFs.

## Comparing two graph coloring algorithms

Somewhat surprisingly, there is no algorithm to solve the general minimal graph coloring problem efficiently (i.e. with polynomial time complexity). There are many heuristics though that approximate the minimal solution. For example, the greedy coloring algorithm can be stated: "for each vertex v_i in V, assign the lowest color not assigned to any of its neighbors." In Haskell, this can be implemented as a fold over a list of vertices, where the accumulator variable is a map from vertices to colors that is initially empty:

-- Given a vertex list and adjacency list, return a graph coloring -- i.e. mapping from vertices to colors where colors = [1..] graphColor::(Ord a)=>[a]->Map a [a]->Map a Int graphColor nodes neighbors = foldl assignColor Data.Map.empty nodes where assignColor colors node = case Data.Map.lookup node neighbors of Nothing -> Data.Map.insert node 1 colors Just ns -> let next_color = lowestSlot $ map ((flip Data.Map.lookup) colors) ns in Data.Map.insert node next_color colors lowestSlot::[Maybe Int]->Int lowestSlot xs = foldl h 1 $ sort xs where h best Nothing = best h best (Just x) | best == x = best + 1 | otherwise = best

There are many ways to represent graphs, with several interesting implementations proposed for Haskell. The greedy algorithm above requires that an adjacency list (a map from a node to its neighbors) be available for each node. I've chosen to use the basic Data.Map type with nodes as keys and lists of neighbors as values. Because I want to focus on the coloring algorithm, I will not discuss further the steps for creating such adjacency lists and instead refer readers to the full source code. The greedy algorithm is quite efficient - O(N) time complexity - but can yield colorings that are far from minimal, depending on the input order of the vertices. So when given ORFs in the order in which they occur in the DNA sequence, how well does the greedy algorithm perform?

To answer that question, I'll compare to a slower yet more exhaustive graph coloring algorithm that I'll call the "extended greedy" algorithm, which can be stated as: "For every remaining uncolored vertex, if it has no neighbor assigned to the current color, assign it to that color. Repeat for the next color." In Haskell it can be implemented like this:

-- Partition vertices into color groups; 'connected' is a function -- that tells if an edge exists between two vertices extendedGreedy::(a->a->Bool)->[a]->[[a]] extendedGreedy _ [] = [[]] extendedGreedy connected xs = let (first, rest) = foldl (\(first, rest) next -> if (not $ any (connected next) first) then ((next:first), rest) else (first, (next:rest))) ([],[]) xs in first:(extendedGreedy connected rest)

The extended greedy algorithm is more time-intensive (consider the case where no ORFs overlap. Then each vertex is still compared with every other so time complexity is at least O(n^2). It gets worse when there is overlap.) But it also seems more thorough and more likely to find a minimal coloring. So how do the two methods compare, both in terms of performance and in terms of finding a minimal coloring? I ran both against a series of four benchmarks derived from the genome of *E. coli* strain e2348/69, an organism I work with in the laboratory: first, the plasmid pGFPE-Ler, a small piece of DNA containing just a few genes. Second, a genomic region responsible for pathogenesis called the LEE comprising about 2% of the genome. Third, the first 400,000 base pairs, or about 10% of the genome. And finally, the entire e2348/69 genome.

Greedy algorithm | Extended Greedy | |||||

Sequence | base pairs | ORF's | Run time, seconds | Chromatic number | Run time, seconds | Chromatic number |

pGFPE-Ler | 6362 | 84 | 0.01 | 10 | 0.01 | 10 |

LEE | 102512 | 1079 | 0.24 | 26 | 0.2 | 26 |

e2348/69; first 400k | 400000 | 5170 | 0.85 | 39 | 0.9 | 39 |

e2348/69 | 5026578 | 61255 | 113 | 75 | 11 | 75 |

The run times demonstrate the linear time complexity of the greedy algorithm and the worse time complexity of the extended greedy algorithm, which becomes really apparent for the largest benchmark. However the extended algorithm appears to be doing no better at finding minimal colorings. It is known that there is an order of input nodes such that the linear greedy algorithm will find an optimal coloring. Perhaps what these results reveal is that the ORFs sorted in the order in which they occur is such an ideal ordering, but I do not now know how to prove that. Perhaps a topic for a later post...