Archive for the ‘functional programming’ Category


DNA sequence annotation: a graph coloring problem

In algorithms,biology,functional programming,haskell,science on June 18, 2012 by gcbenison Tagged: , , ,

In this post I will explore methods for annotating DNA sequences with translated ORFs, like so:

      1870      1880      1890      1900      1910      1920

Creating such annotated diagrams is an application of graph coloring.  My code to produce them is  free software.


DNA is a lot like RAM, except instead of bits coming together in groups of eight to form bytes, DNA bases (of which there are four – A, C, T, and G) come together in groups of three to form codons.  The sequence of codons in turn determines the sequence of amino acids that make up a protein molecule.  The function that maps codons to amino acids (which is notably neither one-to-one nor onto) is called the genetic code, and its deciphering was a major intellectual achievement of the twentieth century.

The analogy between bits and DNA bases breaks down in one important way: the notion of alignment.  The one thing that all computer vendors seem to agree on is that a byte consists of eight bits, that each bit clearly “belongs to” exactly one byte, and that it has a defined place within that byte.  No such thing is true of DNA.  To the molecular machinery that reads it, DNA is just a stream of single bases.  The special sequence of three bases ‘A’, ‘T’, ‘G’ – the “start codon” – tells the DNA-reading machinery to start grouping succeeding triples of bases into codons.  The sequence of codons initiated by the start codon – and then terminated by a sequence known as a “stop codon” – is called an open reading frame (ORF).  Thus each base cannot be inherently the “least significant” or “most significant”, the way that a bit can.  Rather, the significance of a base within a codon is determined entirely by its context within the surrounding frame defined by start and stop codons.  The same base can even simultaneously form part of multiple codons, with a different meaning in each, if it occurs within overlapping reading frames:

        10        20        30        40        50      

In the example above, the ‘C’ base at position 16 occurs in two different contexts: as the final base in the ‘Cys’ codon within the ORF that starts at position 8, and as the first base in the ‘Pro’ codon within the ORF that starts at position 13.

ORF annotation as an application of graph coloring

A display such as the one in the example above, where a DNA sequence is displayed annotated with the ORFs it contains and their translations, is very useful in planning genetic engineering experiments becuase it allows you to edit a DNA sequence and visualize the effects this will have on the encoded proteins.  ORFs that overlap must be displayed on separate lines.  So a program that generates such displays must contain an algorithm to decide on which line to print each ORF.  This deceptively simple-looking task is special case of the graph coloring problem, and another example of how abstract concepts from graph theory tend to show up in everyday contexts.  Let each ORF be the vertex in a graph which contains an edge between any two overlapping ORF’s.  The problem of assigning ORFs to display lines is equivalent to the problem of assigning a color to each node such that no two connected nodes share the same color.   The graph coloring problem is also the key to register allocation, optimal bus scheduling, and many other things.  And it is hard to do well. It is trivially easy to find a coloring of a graph – just assign a different color to each node.  (In the DNA application, just put each ORF on its own line).  It is much harder to find a minimal coloring – a coloring using the fewest possible distinct colors, which in the DNA case corresponds to using the fewest possible display lines to contain all the ORFs.

Comparing two graph coloring algorithms

Somewhat surprisingly, there is no algorithm to solve the general minimal graph coloring problem efficiently (i.e. with polynomial time complexity).  There are many heuristics though that approximate the minimal solution.  For example, the greedy coloring algorithm can be stated: “for each vertex v_i in V, assign the lowest color not assigned to any of its neighbors.”  In Haskell, this can be implemented as a fold over a list of vertices, where the accumulator variable is a map from vertices to colors that is initially empty:

-- Given a vertex list and adjacency list, return a graph coloring
-- i.e. mapping from vertices to colors where colors = [1..]
graphColor::(Ord a)=>[a]->Map a [a]->Map a Int
graphColor nodes neighbors =
  foldl assignColor Data.Map.empty nodes
    where assignColor colors node =
            case Data.Map.lookup node neighbors of
              Nothing -> Data.Map.insert node 1 colors
              Just ns -> let next_color = lowestSlot $ map ((flip Data.Map.lookup) colors) ns
                         in Data.Map.insert node next_color colors

lowestSlot::[Maybe Int]->Int
lowestSlot xs = foldl h 1 $ sort xs
  where h best Nothing = best
        h best (Just x) | best == x = best + 1
                        | otherwise = best

There are many ways to represent graphs, with several interesting implementations proposed for Haskell.  The greedy algorithm above requires that an adjacency list (a map from a node to its neighbors) be available for each node.  I’ve chosen to use the basic Data.Map type with nodes as keys and lists of neighbors as values.    Because I want to focus on the coloring algorithm, I will not discuss further the steps for creating such adjacency lists and instead refer readers to the full source code. The greedy algorithm is quite efficient – O(N) time complexity – but can yield colorings that are far from minimal, depending on the input order of the vertices.  So when given ORFs in the order in which they occur in the DNA sequence, how well does the greedy algorithm perform?

To answer that question, I’ll compare to a slower yet more exhaustive graph coloring algorithm that I’ll call the “extended greedy” algorithm, which can be stated as: “For every remaining uncolored vertex, if it has no neighbor assigned to the current color, assign it to that color.  Repeat for the next color.”  In Haskell it can be implemented like this:

-- Partition vertices into color groups; 'connected' is a function
-- that tells if an edge exists between two vertices
extendedGreedy _ [] = [[]]
extendedGreedy connected xs =
  let (first, rest) = foldl (\(first, rest) next ->
                              if (not $ any (connected next) first)
                                then ((next:first), rest)
                                else (first, (next:rest)))
                            ([],[]) xs
  in first:(extendedGreedy connected rest)

The extended greedy algorithm is more time-intensive (consider the case where no ORFs overlap.  Then each vertex is still compared with every other so time complexity is at least O(n^2).  It gets worse when there is overlap.)  But it also seems more thorough and more likely to find a minimal coloring.  So how do the two methods compare, both in terms of performance and in terms of finding a minimal coloring?  I ran both against a series of four benchmarks derived from the genome of E. coli strain e2348/69, an organism I work with in the laboratory: first, the plasmid pGFPE-Ler, a small piece of DNA containing just a few genes.  Second, a genomic region responsible for pathogenesis called the LEE comprising about 2% of the genome.  Third, the first 400,000 base pairs, or about 10% of the genome.  And finally, the entire e2348/69 genome.

Greedy algorithm Extended Greedy
Sequence base pairs ORF’s Run time, seconds Chromatic number Run time, seconds Chromatic number
pGFPE-Ler 6362 84 0.01 10 0.01 10
LEE 102512 1079 0.24 26 0.2 26
e2348/69; first 400k 400000 5170 0.85 39 0.9 39
e2348/69 5026578 61255 113 75 11 75

The run times demonstrate the linear time complexity of the greedy algorithm and the worse time complexity of the extended greedy algorithm, which becomes really apparent for the largest benchmark.  However the extended algorithm appears to be doing no better at finding minimal colorings.  It is known that there is an order of input nodes such that the linear greedy algorithm will find an optimal coloring.  Perhaps what these results reveal is that the ORFs sorted in the order in which they occur is such an ideal ordering, but I do not now know how to prove that.  Perhaps a topic for a later post…



Deploying a Scheme web application: the Taubatron

In functional programming,Scheme,Web development on April 23, 2012 by gcbenison Tagged: , , ,

This is the story of how I deployed the Taubatron – a web application written in Scheme.  I wrote about what it does earlier; this post is concerned with getting it running on the actual internet, as opposed to on my personal computer.

CGI vs server process

The most straightforward approach would have been to deploy the application as a CGI program.  I had deployed Scheme applications at my hosting provider that way before.  This is also the type of service offered, at the time of this writing, at the free Scheme web hosting site  But for this application, performance with CGI was a problem –

Document Path:          /taubatron.cgi
Document Length:        248 bytes

Concurrency Level:      10
Time taken for tests:   23.234 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      29000 bytes
HTML transferred:       24800 bytes
Requests per second:    4.30 [#/sec] (mean)
Time per request:       2323.402 [ms] (mean)
Time per request:       232.340 [ms] (mean, across all concurrent requests)
Transfer rate:          1.22 [Kbytes/sec] received

Percentage of the requests served within a certain time (ms)
  50%   2321
  66%   2331
  75%   2333
  80%   2338
  90%   2354
  95%   2365
  98%   2380
  99%   2401
 100%   2401 (longest request)

The slowness – up to several seconds to complete a request – had two causes: first, the expense of forking a new guile process for every request, and second, the application’s lengthy initialization phase (building a graph out of a dictionary of words).  Web developers in other languages have found ways to avoid these costs – for Python there is WSGI, and of course there is mod_perl – couldn’t I do as well in Scheme?  I considered mod_lisp and FastCGI but frankly these seemed difficult and perhaps not possible on my host.  The approach that seemed most promising was to run the application as a long-living server process using the built-in HTTP server found in recent versions of the guile Scheme compiler.  Getting such a server application running was about as easy as setting up the CGI program and the performance boost was remarkable:

Concurrency Level:      10
Time taken for tests:   2.488 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      385000 bytes
Total POSTed:           170000
HTML transferred:       306000 bytes
Requests per second:    401.97 [#/sec] (mean)
Time per request:       24.878 [ms] (mean)
Time per request:       2.488 [ms] (mean, across all concurrent requests)
Transfer rate:          151.13 [Kbytes/sec] received
                        66.73 kb/s sent
                        217.86 kb/s total

That’s right – performing the initialization steps just once resulted in a 100-fold performance increase.  Serving requests directly from the built-in HTTP server like this probably represents a lower bound on the latency of this application.  But to use this approach at all, first I would have to get a recent-enough version of guile running on my host, which turned out to be non-trivial.

Installing guile on the host

Of course guile did not come pre-installed on my host the way PHP did and I had no root access so I could not simply install it like I would on a personal machine; however I did have user-level shell access and the necessary tools to compile guile from source.  This is how I had gotten an older version of guile running to host some simpler applications.  But the high-performance web server is found only in a newer guile version which failed to compile from source due to resource limits imposed by the hosting service.  I tried simply uploading a build of guile from a Debian package appropriate for the hosts’ architecture; this failed at run time with an error about a glibc version mismatch.  However, I noticed that the early parts of the build process that involved compiling and linking C code were working fine; the build wouldn’t fail until later when guile was trying to compile parts of itself into bytecode (files with extension ‘.go’ in the build tree).  Figuring that these bytecode files might be architecture-dependent but should not depend on any specific glibc version, I tried simply copying to the build tree from the Debian package those bytecode files which were failing to build.    And it worked – I had a working guile-2.0 compiler installed on my host.

Configuring the front-end server

But of course I wasn’t finished – it’s not as though I could just bind the guile server to port 80 on the shared host and be done with it.  I needed a way to integrate it with the front-end server, Apache in the case of this host.  One way is to bind the guile process to some high-numbered port and use Apache’s RewriteRule to map requests to that port.  But in a shared hosting environment I couldn’t count on just being able to grab an available port.  I had a good discussion about this at Barcamp Portland and concluded that the best approach was to bind the guile process to a local unix socket, and then configure the front-end web server to forward requests to that socket.  Binding the guile-based HTTP server to a unix socket was no problem, but trying to figure out how to get Apache to cooperate in this seemingly simple task was frustrating.  I eventually tried asking the Internet, but apparently it either did not know or did not care.  In contrast, it is easy to find examples of this in nginx.  I soon had my application serving requests through a unix socket behind nginx with a latency of less than 3 msec per request – nearly as fast as the bare guile HTTP server.  (It entertains me that this benchmark, my first use of nginx for any purpose, was achieved on a laptop on Portland’s MAX blue line, heading East.)

The CGI trampoline

Still, I was not finished, because  I didn’t have the option of using nginx on my host – I had to figure out a way to make their Apache installation work for me.  I gave up on finding an Apache configuration directive to do this and realized that there was a alternative that was also likely to be portable to just about any host, no matter which web server it was running or how it was configured –  I could write a lightweight CGI process that would simply open up a connection to the socket, forward the request, and echo back the response.  I called this a “CGI trampoline”, implemented it, and after the fact found at least one other implementation of the same idea using the same name.  My first implementation was in Scheme, and I had my web application serving requests through a unix socket behind Apache with a latency of 39 msec – ten times slower than the bare guile HTTP server, but still ten times better than the whole application as a CGI process.  The performance hit was due of course to the cost of starting a new guile process for each request.  I rewrote the CGI trampoline in C and request latency dropped to 3.6 msec – pretty good compared to the lower bound of 2.4 msec achieved by serving requests directly from the application running as an HTTP server.

And that’s how the Taubatron was deployed – try it out here!


You’re only using PHP because you don’t have Guile and (sxml simple)

In functional programming,Scheme,Web development on February 19, 2012 by gcbenison Tagged: , , ,

OK, maybe you would still use PHP even if you did have a Scheme implementation and the (sxml simple) module.  But it is true that just about any web application written in PHP could also be written in Scheme or another Lisp dialect.   And yes you can get it hosted, more easily than you might think; I’ve done it (see this example, described at the end) with just the basic services at my hosting provider.  There are even complete Lisp web frameworks available.

So web developers do use Scheme, even if it is far from the most popular web language.  But web development in Lisp is not just possible: it is also a good idea, especially as your project evolves past “a page with some dynamic content” to “a full application that runs over the web”.  HTML can be understood in two ways:

  1. As plain text with a few formatting tags thrown in
  2. As a tree of nodes, where nodes have names and optionally, attributes and values

These two views lead to two different approaches to dynamic web programming:

  1. Treat the web page as a plain text document, with little embedded snippets of code that expand like macros when the page is displayed
  2. Represent the entire web page as a tree using a native data structure, which is then rendered as HTML at page display time

Both HTML and many web programming languages were designed with view (1) in mind because it can be very simple.  You don’t need to be a programmer, or to understand what the phrase “tree of nodes” means, in order to understand what this is supposed to do:

Today is: <?php echo date('Y-m-d'); ?> and it is a <i>good</i> day

But the problem with this approach is that it doesn’t scale well to more complicated uses.   Take, for instance, this production PHP snippet from the WordPress code base (pasted here as an image because of the difficulty of embedding actual literal PHP here):

That, despite being an example of well written PHP, is not terribly intuitive even to a programmer.   I really dislike the  intermingling of two distinct syntaxes (that of HTML and PHP).   It forces you to switch back and forth between the two languages when mentally parsing such code.  It leads to odd orderings of delimiters –  <?php { ?> etc.   It’s what leads to PHP being called things like “bloated, messy, and counter-intuitive“.

Such contortions do not appear in web applications that treat the whole page as a native data structure which the system then renders as HTML.   Since all of the actual HTML is generated by the renderer, the application developer need not even know any HTML syntax in order to program this way!  Scheme is well-suited to representing such a tree of nodes as a native data structure, and there is a standard convention for doing so known as SXML.  One of the things I like best about Scheme for web development is the clear equivalence of a live Scheme object, its serialized representation as SXML, and its serialized representation as HTML – all of which can be interconverted with a single command.

Interconversion of HTML and its equivalent Scheme representation

This means we can take the result of any Scheme computation, manipulate it into SXML, and render it as HTML.  A working implementation of the following example can be found here.  Consider a function top-word-usage which splits a string into words and returns the number of times each word appears, sorted with the most common words first:

> (top-word-usage "Hello, world.  Yes- hello, world, hello!" 5)
$1 = (("hello" . 3) ("world" . 2) ("yes" . 1))

Now define a function to format each entry as a string, and another function to wrap any content as an SXML “list element”:

(define (format-usage usage)
(format #f "~a -- ~a occurrences" (car usage)(cdr usage)))

(define (as-li content)
`(li ,content))

> (map as-li (map format-usage $1))

$2 = ((li "hello -- 3 occurrences")
      (li "world -- 2 occurrences")
      (li "yes -- 1 occurrences"))

Around this core functionality, we can use SXML to build a template for a complete web application:

(display "Content-type: text/html\n\n")
   (head (title "wc demo"))
    (h1 (a (@ (href ,(car (command-line)))) "Word count demo"))
    ,(if input
	 `(p "Top words of at least " ,min-word-length " characters:"
	      ,(map as-li
		    (map format-usage (top-word-usage input 5)))))
	    '(p "Enter some text below, to count word usage:")))
    (form (@ (method "post")
	     (name "input-form")
	     (action ,(car (command-line))))
	  (textarea (@ (name "input-content")
		       (rows 20)
		       (cols 60))
		    ,(or input ""))
	  (p "Minimum word length: "
	     (input (@ (type "text")
		       (name "min-length"))))
	  (input (@ (type "submit")
		    (name "input-submit")
		    (value "Determine word usage")))))))

The full application is less than 100 lines long and can be found here. Try pasting in a favorite piece of writing and seeing what happens – it’s actually kind of fun!