DNA sequence annotation: a graph coloring problem

In algorithms, biology, functional programming, haskell, science on June 18, 2012 by gcbenison Tagged: , , ,

In this post I will explore methods for annotating DNA sequences with translated ORFs, like so:

      1870      1880      1890      1900      1910      1920

Creating such annotated diagrams is an application of graph coloring.  My code to produce them is  free software.


DNA is a lot like RAM, except instead of bits coming together in groups of eight to form bytes, DNA bases (of which there are four – A, C, T, and G) come together in groups of three to form codons.  The sequence of codons in turn determines the sequence of amino acids that make up a protein molecule.  The function that maps codons to amino acids (which is notably neither one-to-one nor onto) is called the genetic code, and its deciphering was a major intellectual achievement of the twentieth century.

The analogy between bits and DNA bases breaks down in one important way: the notion of alignment.  The one thing that all computer vendors seem to agree on is that a byte consists of eight bits, that each bit clearly “belongs to” exactly one byte, and that it has a defined place within that byte.  No such thing is true of DNA.  To the molecular machinery that reads it, DNA is just a stream of single bases.  The special sequence of three bases ‘A’, ‘T’, ‘G’ – the “start codon” – tells the DNA-reading machinery to start grouping succeeding triples of bases into codons.  The sequence of codons initiated by the start codon – and then terminated by a sequence known as a “stop codon” – is called an open reading frame (ORF).  Thus each base cannot be inherently the “least significant” or “most significant”, the way that a bit can.  Rather, the significance of a base within a codon is determined entirely by its context within the surrounding frame defined by start and stop codons.  The same base can even simultaneously form part of multiple codons, with a different meaning in each, if it occurs within overlapping reading frames:

        10        20        30        40        50      

In the example above, the ‘C’ base at position 16 occurs in two different contexts: as the final base in the ‘Cys’ codon within the ORF that starts at position 8, and as the first base in the ‘Pro’ codon within the ORF that starts at position 13.

ORF annotation as an application of graph coloring

A display such as the one in the example above, where a DNA sequence is displayed annotated with the ORFs it contains and their translations, is very useful in planning genetic engineering experiments becuase it allows you to edit a DNA sequence and visualize the effects this will have on the encoded proteins.  ORFs that overlap must be displayed on separate lines.  So a program that generates such displays must contain an algorithm to decide on which line to print each ORF.  This deceptively simple-looking task is special case of the graph coloring problem, and another example of how abstract concepts from graph theory tend to show up in everyday contexts.  Let each ORF be the vertex in a graph which contains an edge between any two overlapping ORF’s.  The problem of assigning ORFs to display lines is equivalent to the problem of assigning a color to each node such that no two connected nodes share the same color.   The graph coloring problem is also the key to register allocation, optimal bus scheduling, and many other things.  And it is hard to do well. It is trivially easy to find a coloring of a graph – just assign a different color to each node.  (In the DNA application, just put each ORF on its own line).  It is much harder to find a minimal coloring – a coloring using the fewest possible distinct colors, which in the DNA case corresponds to using the fewest possible display lines to contain all the ORFs.

Comparing two graph coloring algorithms

Somewhat surprisingly, there is no algorithm to solve the general minimal graph coloring problem efficiently (i.e. with polynomial time complexity).  There are many heuristics though that approximate the minimal solution.  For example, the greedy coloring algorithm can be stated: “for each vertex v_i in V, assign the lowest color not assigned to any of its neighbors.”  In Haskell, this can be implemented as a fold over a list of vertices, where the accumulator variable is a map from vertices to colors that is initially empty:

-- Given a vertex list and adjacency list, return a graph coloring
-- i.e. mapping from vertices to colors where colors = [1..]
graphColor::(Ord a)=>[a]->Map a [a]->Map a Int
graphColor nodes neighbors =
  foldl assignColor Data.Map.empty nodes
    where assignColor colors node =
            case Data.Map.lookup node neighbors of
              Nothing -> Data.Map.insert node 1 colors
              Just ns -> let next_color = lowestSlot $ map ((flip Data.Map.lookup) colors) ns
                         in Data.Map.insert node next_color colors

lowestSlot::[Maybe Int]->Int
lowestSlot xs = foldl h 1 $ sort xs
  where h best Nothing = best
        h best (Just x) | best == x = best + 1
                        | otherwise = best

There are many ways to represent graphs, with several interesting implementations proposed for Haskell.  The greedy algorithm above requires that an adjacency list (a map from a node to its neighbors) be available for each node.  I’ve chosen to use the basic Data.Map type with nodes as keys and lists of neighbors as values.    Because I want to focus on the coloring algorithm, I will not discuss further the steps for creating such adjacency lists and instead refer readers to the full source code. The greedy algorithm is quite efficient – O(N) time complexity – but can yield colorings that are far from minimal, depending on the input order of the vertices.  So when given ORFs in the order in which they occur in the DNA sequence, how well does the greedy algorithm perform?

To answer that question, I’ll compare to a slower yet more exhaustive graph coloring algorithm that I’ll call the “extended greedy” algorithm, which can be stated as: “For every remaining uncolored vertex, if it has no neighbor assigned to the current color, assign it to that color.  Repeat for the next color.”  In Haskell it can be implemented like this:

-- Partition vertices into color groups; 'connected' is a function
-- that tells if an edge exists between two vertices
extendedGreedy _ [] = [[]]
extendedGreedy connected xs =
  let (first, rest) = foldl (\(first, rest) next ->
                              if (not $ any (connected next) first)
                                then ((next:first), rest)
                                else (first, (next:rest)))
                            ([],[]) xs
  in first:(extendedGreedy connected rest)

The extended greedy algorithm is more time-intensive (consider the case where no ORFs overlap.  Then each vertex is still compared with every other so time complexity is at least O(n^2).  It gets worse when there is overlap.)  But it also seems more thorough and more likely to find a minimal coloring.  So how do the two methods compare, both in terms of performance and in terms of finding a minimal coloring?  I ran both against a series of four benchmarks derived from the genome of E. coli strain e2348/69, an organism I work with in the laboratory: first, the plasmid pGFPE-Ler, a small piece of DNA containing just a few genes.  Second, a genomic region responsible for pathogenesis called the LEE comprising about 2% of the genome.  Third, the first 400,000 base pairs, or about 10% of the genome.  And finally, the entire e2348/69 genome.

Greedy algorithm Extended Greedy
Sequence base pairs ORF’s Run time, seconds Chromatic number Run time, seconds Chromatic number
pGFPE-Ler 6362 84 0.01 10 0.01 10
LEE 102512 1079 0.24 26 0.2 26
e2348/69; first 400k 400000 5170 0.85 39 0.9 39
e2348/69 5026578 61255 113 75 11 75

The run times demonstrate the linear time complexity of the greedy algorithm and the worse time complexity of the extended greedy algorithm, which becomes really apparent for the largest benchmark.  However the extended algorithm appears to be doing no better at finding minimal colorings.  It is known that there is an order of input nodes such that the linear greedy algorithm will find an optimal coloring.  Perhaps what these results reveal is that the ORFs sorted in the order in which they occur is such an ideal ordering, but I do not now know how to prove that.  Perhaps a topic for a later post…


The logic bomb that helped me embrace free scientific software

In Free software, science on May 15, 2012 by gcbenison Tagged: , , , ,

Scientists often write custom, domain-specific computer programs as part of their research.  But the source code to those programs – when it is released at all – is commonly released under restrictive licenses that prevent redistribution or modification.  Recent editorials in the two most prominent scientific journals arguing for greater sharing of scientific source code have drawn some long-overdue attention to this issue (Reviews of the editorials are freely available online, but both original articles – even while arguing for more openness in science – sit behind massive paywalls, an irony that has been picked up by other observers as well.  Open access is an important issue, but it is a topic for another post.)  The common practice of publishing scientific conclusions based on code that is not itself made public has long troubled me.  One incident in particular brought the issue into sharp focus in my mind.

A defining moment

In the summer of 2003 I spent two months in Puerto Rico to be with my wife, who was doing her dissertation research on shrimp that live in the streams there.  I was a graduate student too at the time, trying to finish my own degree in chemistry.  I knew that during the trip I would have a lot of “down time” holed up in our apartment, which I planned to spend analysing data I had accumulated over several years. Before heading for the tropics, I made sure I had all of my data with me, and the tools I needed to work with it.

One of those tools was a program for translating the raw Nuclear Magnetic Resonance (NMR) data generated by instruments into a more useful representation.  This program, which had been written by scientists working at the National Institutes of Health, was available gratis to researchers and widely used in my field.  I relied on this program, so before leaving for the tropics I made sure that my copy was good.  So, it came as a great surprise several days into my visit when the program abruptly stopped working: it contained a logic bomb. Past a certain date, it would refuse to start normally, instead attempting to send you an email telling you to download the next release.

The program’s authors would claim that the logic bomb was innocuous, that it’s only purpose was to reduce the burden of supporting multiple versions by keeping all users in sync.  They promised to always provide the next release for free, and they assumed that most users would be in a university setting with a high-speed internet connection, so that the forced upgrade would be a simple matter completed in a couple of minutes.  But in my case their assumptions turned out to be wrong.  The nearest internet connection was a half day’s drive away, slow, and unreliable.  I could not reasonably ask my wife to interrupt her field work schedule so that I could attempt an upgrade, which might not even succeed.  Besides, I had planned in advance to have everything I needed with me, and was annoyed at having my plans derailed by this logic bomb.  And despite having programming skills and a decent development environment available to me, I could not solve my problem simply by editing the program’s source, because I did not have it – the authors declined to distribute that.  That wasn’t some kind of oversight on their part: in a startup message displayed by the program, the authors made clear their intention to prevent any redistribution or modification by users, even to solve their own problems:

“This program and its related software is based  on  a  design  originating in the NIH Laboratory of Chemical Physics, NIDDK.  It is not to be distributed or modified.”

This was a radicalizing, formative moment for me, my equivalent of the encounter between the young Richard Stallman and a malfunctioning laser printer.  I was already a user of Linux and GNU and I had started to learn about the Free Software movement, but it took this incident to show me clearly how attempting to control the behavior of users – even in benign, relatively mild ways like forced upgrades – can have unintended consequences.  And I began to question whether the motivation behind the restrictions on this code really were so benign.  After all, the program’s authors were also my potential competitors.  They said they would always provide future releases freely, but nothing required that they keep this promise. What if they one day decided to stop providing their upgrades?  What if they only provided it to certain people of their choosing?  The user community that had built up around this program was dependent on its original authors in a way that is unusual in science.  By banning me or anyone else from building on their work, and publishing the results, they were ensuring that any advances would come from them. It’s normal in science to care deeply about getting proper credit for one’s work and to reserve the right to choose the time and place of publication.  But it is not normal, once you’ve published a method, to try to prevent others from reading about it, modifying it, and publishing their modifications.  Not normal, that is, unless the method happens to be in the form of a computer program, which is just a method of converting input to output.

Broader questions aside, I still was faced with the immediate problem of not being able to do my work.  Not ready to accept the ethical or legal validity of the restrictions that had been placed on me, I set out to solve my own problem however I could.  And as it so often turns out, these programmers’ desire to control their users’ behavior through technical means exceeded their technical ability to do so.

Disabling the logic bomb

I quickly figured out that I could get the program to run by setting the system clock back a month.  But this was inconvenient, led to incorrect time stamps on things, and led to all sorts of other problems.  So I set out to circumvent the logic bomb in the program itself.  I was a chemistry student, not a computer science student, but even at that stage in my career I at least knew enough about programming to surmise that the program might call “time()” somewhere and compare the result to a fixed value.  I fired up the program in a debugger, and saw this:

#0  0xb7ed10f5 in time () from /lib/
#1  0x0809a6c3 in pastDate2 ()
#2  0x0809a5b9 in pastDate ()
#3  0x0804c285 in main ()

OK, not a whole lot of subterfuge there.  What if I just made “pastDate()” return right away?  I did this:

 0809a5a4 :
  809a5a4:      55                      push   %ebp
  809a5a5:      89 e5                   mov    %esp,%ebp
- 809a5a7:      53                      push   %ebx
- 809a5a8:      83 ec 08                sub    $0x8,%esp
- 809a5ab:      68 d6 07 00 00          push   $0x7d6
+ 809a5a7:      c9                      leave  
+ 809a5a8:      c3                      ret    
+ 809a5a9:      ec                      in     (%dx),%al
+ 809a5aa:      08 68 d6                or     %ch,-0x2a(%eax)
+ 809a5ad:      07                      pop    %es
+ 809a5ae:      00 00                   add    %al,(%eax)
  809a5b0:      6a 01                   push   $0x1
  809a5b2:      6a 08                   push   $0x8
  809a5b4:      e8 d3 00 00 00          call   809a68c

And it worked – it took a change of two bytes to disable the logic bomb.  Incidentally, I continued using the program for several more years and never did upgrade.  I never saw the need to.

Why is scientific software so closed?

Every field of science today is dependent on computation, but research being what it is, there are often computational needs that can’t be met by any widely-available, well-known tool.  So scientists who know some programming tend to write these niche, domain-specific applications themselves.  But scientists who know some programming often aren’t very familiar with best practices in software development, and that includes licensing – many scientists, even those who program, have never heard of the GPL.

As a result, such niche scientific programs, when they are released at all, are often released under custom licenses that were written without much thought.  And these custom licenses, more often than not, are quite restrictive.  Why?  There are several reasons.  Some scientists, conscious of their limitations as programmers, don’t want their source code seen out of fear of embarrassment (reputation is everything in science.)  Some are concerned about the potential extra work of supporting a user base if their programs are distributed. Some scientists – or their institution’s intellectual property office
– may be holding on to the hope of some day profiting from the code, though this hope is forlorn for most niche, domain-specific code that is only of interest to a handful of colleagues.  Some may actually want to make it harder for others to build on their work, and thus increase their own standing.  That kind of territorial attitude is out of place in science and can actually backfire.

I don’t know if the GPL is the answer to everything in scientific licensing, though I have used it for my own most significant scientific software.  I do know that it should be used more often than it is, and that scientific licensing in general right now tends to be too restrictive.  And I know that the scientific community isn’t thinking about licensing enough.  Even the editorials in Science and Nature are another illustration of this.  They call for distribution of source code that underlies published results.  But access to source code is only part of what software licensing is about.  What about derivative works?  What about warranties?  What about redistribution?

Science depends on the exchange of data, information, and ideas.  In a sense that’s what science is.  More and more, computers are what we use to do this.  So we need to be thinking about standards of behavior for how we use them.  “The code is mine, all mine” won’t cut it.


Deploying a Scheme web application: the Taubatron

In functional programming, Scheme, Web development on April 23, 2012 by gcbenison Tagged: , , ,

This is the story of how I deployed the Taubatron – a web application written in Scheme.  I wrote about what it does earlier; this post is concerned with getting it running on the actual internet, as opposed to on my personal computer.

CGI vs server process

The most straightforward approach would have been to deploy the application as a CGI program.  I had deployed Scheme applications at my hosting provider that way before.  This is also the type of service offered, at the time of this writing, at the free Scheme web hosting site  But for this application, performance with CGI was a problem –

Document Path:          /taubatron.cgi
Document Length:        248 bytes

Concurrency Level:      10
Time taken for tests:   23.234 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      29000 bytes
HTML transferred:       24800 bytes
Requests per second:    4.30 [#/sec] (mean)
Time per request:       2323.402 [ms] (mean)
Time per request:       232.340 [ms] (mean, across all concurrent requests)
Transfer rate:          1.22 [Kbytes/sec] received

Percentage of the requests served within a certain time (ms)
  50%   2321
  66%   2331
  75%   2333
  80%   2338
  90%   2354
  95%   2365
  98%   2380
  99%   2401
 100%   2401 (longest request)

The slowness – up to several seconds to complete a request – had two causes: first, the expense of forking a new guile process for every request, and second, the application’s lengthy initialization phase (building a graph out of a dictionary of words).  Web developers in other languages have found ways to avoid these costs – for Python there is WSGI, and of course there is mod_perl – couldn’t I do as well in Scheme?  I considered mod_lisp and FastCGI but frankly these seemed difficult and perhaps not possible on my host.  The approach that seemed most promising was to run the application as a long-living server process using the built-in HTTP server found in recent versions of the guile Scheme compiler.  Getting such a server application running was about as easy as setting up the CGI program and the performance boost was remarkable:

Concurrency Level:      10
Time taken for tests:   2.488 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      385000 bytes
Total POSTed:           170000
HTML transferred:       306000 bytes
Requests per second:    401.97 [#/sec] (mean)
Time per request:       24.878 [ms] (mean)
Time per request:       2.488 [ms] (mean, across all concurrent requests)
Transfer rate:          151.13 [Kbytes/sec] received
                        66.73 kb/s sent
                        217.86 kb/s total

That’s right – performing the initialization steps just once resulted in a 100-fold performance increase.  Serving requests directly from the built-in HTTP server like this probably represents a lower bound on the latency of this application.  But to use this approach at all, first I would have to get a recent-enough version of guile running on my host, which turned out to be non-trivial.

Installing guile on the host

Of course guile did not come pre-installed on my host the way PHP did and I had no root access so I could not simply install it like I would on a personal machine; however I did have user-level shell access and the necessary tools to compile guile from source.  This is how I had gotten an older version of guile running to host some simpler applications.  But the high-performance web server is found only in a newer guile version which failed to compile from source due to resource limits imposed by the hosting service.  I tried simply uploading a build of guile from a Debian package appropriate for the hosts’ architecture; this failed at run time with an error about a glibc version mismatch.  However, I noticed that the early parts of the build process that involved compiling and linking C code were working fine; the build wouldn’t fail until later when guile was trying to compile parts of itself into bytecode (files with extension ‘.go’ in the build tree).  Figuring that these bytecode files might be architecture-dependent but should not depend on any specific glibc version, I tried simply copying to the build tree from the Debian package those bytecode files which were failing to build.    And it worked – I had a working guile-2.0 compiler installed on my host.

Configuring the front-end server

But of course I wasn’t finished – it’s not as though I could just bind the guile server to port 80 on the shared host and be done with it.  I needed a way to integrate it with the front-end server, Apache in the case of this host.  One way is to bind the guile process to some high-numbered port and use Apache’s RewriteRule to map requests to that port.  But in a shared hosting environment I couldn’t count on just being able to grab an available port.  I had a good discussion about this at Barcamp Portland and concluded that the best approach was to bind the guile process to a local unix socket, and then configure the front-end web server to forward requests to that socket.  Binding the guile-based HTTP server to a unix socket was no problem, but trying to figure out how to get Apache to cooperate in this seemingly simple task was frustrating.  I eventually tried asking the Internet, but apparently it either did not know or did not care.  In contrast, it is easy to find examples of this in nginx.  I soon had my application serving requests through a unix socket behind nginx with a latency of less than 3 msec per request – nearly as fast as the bare guile HTTP server.  (It entertains me that this benchmark, my first use of nginx for any purpose, was achieved on a laptop on Portland’s MAX blue line, heading East.)

The CGI trampoline

Still, I was not finished, because  I didn’t have the option of using nginx on my host – I had to figure out a way to make their Apache installation work for me.  I gave up on finding an Apache configuration directive to do this and realized that there was a alternative that was also likely to be portable to just about any host, no matter which web server it was running or how it was configured –  I could write a lightweight CGI process that would simply open up a connection to the socket, forward the request, and echo back the response.  I called this a “CGI trampoline”, implemented it, and after the fact found at least one other implementation of the same idea using the same name.  My first implementation was in Scheme, and I had my web application serving requests through a unix socket behind Apache with a latency of 39 msec – ten times slower than the bare guile HTTP server, but still ten times better than the whole application as a CGI process.  The performance hit was due of course to the cost of starting a new guile process for each request.  I rewrote the CGI trampoline in C and request latency dropped to 3.6 msec – pretty good compared to the lower bound of 2.4 msec achieved by serving requests directly from the application running as an HTTP server.

And that’s how the Taubatron was deployed – try it out here!


The Nerdy Stuff Matters

In algorithms on March 28, 2012 by gcbenison Tagged: , ,

A childhood friend who grew up to be an artist has created many pieces based on letters and words – not just their meanings, but their visual form and their relationships to each other.  Here is one of them:


The word “yes” changes into the word “no” in five steps, where each step involves either changing, adding, or deleting exactly one letter, with the result of each step being a valid English word.  Upon seeing this I – being a computer programmer – immediately wondered if I could come up with an algorithm to find such sequences in a more general way.  That is, given a starting word and an ending word, could I design an algorithm to transform one word into the other by changing, adding, or deleting exactly one letter at a time, with the result being a valid English word at each step?

After some thought, I did come up with an approach and soon had a working implementation (fair warning: the link is just the source code for a program that does the job; you have to download and compile it to use it.  I’m working on an online version.)  Here is an example of its output:

  • table
  • able
  • ale
  • all
  • hall
  • hail
  • hair
  • chair

Though I had a working program I wanted to see how other programmers would approach the problem, so I posted it as a challenge on Stackoverflow, a site where programmers exchange questions and answers.  As it turns out, this question is a quite common programming challenge (although the rules about adding and deleting a character are most often left out – so that “cat hat hot” would be a valid sequence, but “cat chat that than tan” would not be.  In this form the sequences are called word ladders.)  Variants have appeared as an interview question.  Within minutes of posting, I received answers describing essentially the same approach I had used: “Represent the word relationships as a graph, and use a shortest path algorithm.  Duh.”

Word Ladders as Applied Graph Theory

For those readers not steeped in discrete math and computer science, some definitions are in order.  A graph is a collection of vertices – in this case, the words – and edges – in this case, the transformations relating the words to each other.  Such a graph formed from an English dictionary can be visualized as quite a large web spreading through space; here is a small part of it –

portion of the English language word graph

Once such a graph is defined, the problem of finding the desired sequence of words becomes the problem of finding the shortest possible path between two points in the graph.  This is not an easy problem, but it is a very standard one, so it becomes possible to make use of its very well-known solution.  As a programmer, you need not re-implement the solution or even understand it particularly well; you can just re-use an existing implementation.  I used the one provided by the excellent igraph library.  You simply feed it the graph, the start point, and the end point, and it returns the path.  So the only real work is to build the word graph from all the words in the language.

Building the Word Graph

To build a graph out of all words in an English dictionary, you must provide a list of all pairs of words (cat <-> hat ; hat <-> hot; hot <-> shot) that should be connected.  The graph library doesn’t care what order you provide them in, as long as you provide each pair once.  Clearly the list will be large for the English language – there are over 200,000 words in the spelling dictionary on my computer, and there will be many more connections between these.  So how can you generate a list of all possible pairs?  How can you avoid finding the same pair repeatedly?  How do you know when you’re done?  How can you do this efficiently?

A naive algorithm would compare every word in the dictionary, in order, to every other word, in order.  But this is slow – the number of comparisons grows as the square of the size of the dictionary.  We can do much better by sorting the words into equivalence classes.  Consider words that differ only in their first letter.  We put these into an equivalence class by replacing the first character with a special place-holder.  For example, “walk” and “talk” both become “*alk”; they are in the same equivalence class and therefore should be connected.  Apply a similar procedure to the second character, the third character, etc… to build up a table like this:

walk *alk
walk w*lk
walk wa*k
walk wal*
tall *all
tall t*ll
tall ta*l
tall tal*
ear *ar
ear e*r
ear ea*
talk *alk
talk t*lk
talk ta*k
talk tal*

OK, so how does this speed things up?  It becomes clear when we sort the table by equivalence class:

walk *alk
talk *alk
tall *all
ear *ar
ear e*r
ear ea*
talk t*lk
tall t*ll
talk ta*k
tall ta*l
talk tal*
tall tal*
walk w*lk
walk wa*k
walk wal*

The sort operation has grouped together all words that should be connected; “walk -> talk” and “talk -> tall” in this example.  The slow step in this procedure is the sorting, which can be achieved by any of a number of standard algorithms in time proportional to n * log n – a big improvement over n * n for large dictionaries.

So what about the rule that connects words by  adding or deleting one character? This variant of the problem is harder to find online, but I did find an implementation in a comment on this blog similar to the one I used.  The key is to create a table, as before, but for each word add all possible rows formed by deleting one letter:

bring ring
bring bing
bring brng
bring brig
bring brin
rings ings
rings rngs
rings rigs
rings rins
rings ring
ear ar
ear er
ear ea

Then sort by the second column, and filter against the sorted dictionary to retain only those rows where the second column is an actual word:

ear ar
bring bing
bring brig
bring brin
bring brng
ear ea
ear er
rings ings
rings rigs
rings ring
bring ring
rings rins
rings rngs

So that in this example, the three rules “rings <-> ring”, “rings <-> rigs”, and “bring <-> ring” were found.  As before, the sorting is the slow step with O(n log n) time complexity.  So how do you combine the connections due to letter substitution with the connections due to deletion?  This is called a graph union and is supported by any graph library.  An interesting effect happens if you include only the connections due to deletion and leave out connections due to substitution.  Now the word must change length by one letter at each step!

  • hands
  • hand
  • and
  • land
  • lad
  • lead
  • led
  • fled
  • fed
  • feed
  • fee
  • feet

Conclusion: the nerdy stuff matters

Every practicing programmer knows that it is possible – perhaps normal is the right word – to write practical, useful code without drawing on the nerdier parts of computer science. You can write a program that works just fine without ever analyzing it for time complexity or proving its correctness.  The simplest, most intuitive data structures are often best.  But hopefully what the word ladder  example illustrates is that often the best way to solve an everyday, practical problem is to translate it into a special case of a well-studied abstract problem, and then apply a standard solution to that problem.  If solving a word ladder does not seem very practical, applied graph theory shows up everywhere: in revision management, in the provenance of lab samples, in deciding the best layout for a communications network.  So keep studying the nerdy stuff – one day, you will find yourself facing a down-to-earth problem that is harder than it looks, and realize that you already have the answer.


Guest commentary: Thien-Thi Nguyen on guile-www

In Scheme, Web development on March 11, 2012 by gcbenison Tagged: , , , ,

When I needed to develop web applications for the first time, naturally I turned to Scheme, a language which has been central to my evolution as a programmer and which remains my go-to language for many things.  I am still an enthusiast for web development in Scheme, and I think it could become much more popular than it is, particularly if it were easier to find hosting and support.

All the libraries you need for Scheme web development already exist. One of the earliest was guile-www, which provides support for HTTP, cookies, CGI, and related things.  For this post I turned to the current maintainer of guile-www to ask what motivated him to become involved in Scheme web development.  It appears that, perhaps because Scheme remains something of a fringe language on the web, the ecosystem around it continues to evolve rapidly.  Projects fork, merge, and fork again; there isn’t any one “Scheme stack” that dominates.  Nevertheless, the basic reasons for using Scheme for web development – it’s simple, it’s concise, it’s elegant – remain the same.

What drew you to get involved in maintaining guile-www?

ttn: Back in 1998 I made various motorcycle trips, and published a “trip log” with photos to a website.  This used some custom Emacs Lisp to push the html and jpeg files to the server (you might find wup.el — web update — somewhere on the net).  I had just discovered Guile and was intrigued by its module system, certainly a step up from Emacs’ crowded flatland.

I stumbled onto Guile-WWW, which IIRC at that time was client-side and HTTP/1.0 only, and started fiddling with it.  Around 2000 or 2001, I somehow also finangled write privs to Guile proper (the umbrella project that also included the CVS sub-repo of Guile-WWW) and started landing patches there.  Unfortunately, due to a misunderstanding with the former Guile maintainer (Marius Vollmer), those privs got cancelled.

Since the code is GPL, i started maintaining the package privately, naming myself as the copyright holder for new code and making releases every so often.  The hope was (and continues to be, though ever more faintly, seeing how things have developed since) that after some time, my changes could be merged back to GNU Guile.

What advantages do you see Scheme having over other languages for web development?

ttn: Less typing and more fun: My Emacs inserts () instead of [] automatically, edits and motions are by sexps, Emacs boldfaces the current list, etc.

Do you have a favorite application that uses guile-www?

ttn: That would be Sizzweb , i suppose.  I use it to serve Debian ISOs and other stuff on the home network.  Client-side, not so much happening…

The development of guile has picked up in pace in the past few  years; do you see its niche growing and that of guile-www with it?

ttn: Yes.  I believe bundling (web …) modules in Guile proper was a strategic mistake that renders Guile-WWW independence even more important.  So “with it” might be better read as “parallel to it”.


You’re only using PHP because you don’t have Guile and (sxml simple)

In functional programming, Scheme, Web development on February 19, 2012 by gcbenison Tagged: , , ,

OK, maybe you would still use PHP even if you did have a Scheme implementation and the (sxml simple) module.  But it is true that just about any web application written in PHP could also be written in Scheme or another Lisp dialect.   And yes you can get it hosted, more easily than you might think; I’ve done it (see this example, described at the end) with just the basic services at my hosting provider.  There are even complete Lisp web frameworks available.

So web developers do use Scheme, even if it is far from the most popular web language.  But web development in Lisp is not just possible: it is also a good idea, especially as your project evolves past “a page with some dynamic content” to “a full application that runs over the web”.  HTML can be understood in two ways:

  1. As plain text with a few formatting tags thrown in
  2. As a tree of nodes, where nodes have names and optionally, attributes and values

These two views lead to two different approaches to dynamic web programming:

  1. Treat the web page as a plain text document, with little embedded snippets of code that expand like macros when the page is displayed
  2. Represent the entire web page as a tree using a native data structure, which is then rendered as HTML at page display time

Both HTML and many web programming languages were designed with view (1) in mind because it can be very simple.  You don’t need to be a programmer, or to understand what the phrase “tree of nodes” means, in order to understand what this is supposed to do:

Today is: <?php echo date('Y-m-d'); ?> and it is a <i>good</i> day

But the problem with this approach is that it doesn’t scale well to more complicated uses.   Take, for instance, this production PHP snippet from the WordPress code base (pasted here as an image because of the difficulty of embedding actual literal PHP here):

That, despite being an example of well written PHP, is not terribly intuitive even to a programmer.   I really dislike the  intermingling of two distinct syntaxes (that of HTML and PHP).   It forces you to switch back and forth between the two languages when mentally parsing such code.  It leads to odd orderings of delimiters –  <?php { ?> etc.   It’s what leads to PHP being called things like “bloated, messy, and counter-intuitive“.

Such contortions do not appear in web applications that treat the whole page as a native data structure which the system then renders as HTML.   Since all of the actual HTML is generated by the renderer, the application developer need not even know any HTML syntax in order to program this way!  Scheme is well-suited to representing such a tree of nodes as a native data structure, and there is a standard convention for doing so known as SXML.  One of the things I like best about Scheme for web development is the clear equivalence of a live Scheme object, its serialized representation as SXML, and its serialized representation as HTML – all of which can be interconverted with a single command.

Interconversion of HTML and its equivalent Scheme representation

This means we can take the result of any Scheme computation, manipulate it into SXML, and render it as HTML.  A working implementation of the following example can be found here.  Consider a function top-word-usage which splits a string into words and returns the number of times each word appears, sorted with the most common words first:

> (top-word-usage "Hello, world.  Yes- hello, world, hello!" 5)
$1 = (("hello" . 3) ("world" . 2) ("yes" . 1))

Now define a function to format each entry as a string, and another function to wrap any content as an SXML “list element”:

(define (format-usage usage)
(format #f "~a -- ~a occurrences" (car usage)(cdr usage)))

(define (as-li content)
`(li ,content))

> (map as-li (map format-usage $1))

$2 = ((li "hello -- 3 occurrences")
      (li "world -- 2 occurrences")
      (li "yes -- 1 occurrences"))

Around this core functionality, we can use SXML to build a template for a complete web application:

(display "Content-type: text/html\n\n")
   (head (title "wc demo"))
    (h1 (a (@ (href ,(car (command-line)))) "Word count demo"))
    ,(if input
	 `(p "Top words of at least " ,min-word-length " characters:"
	      ,(map as-li
		    (map format-usage (top-word-usage input 5)))))
	    '(p "Enter some text below, to count word usage:")))
    (form (@ (method "post")
	     (name "input-form")
	     (action ,(car (command-line))))
	  (textarea (@ (name "input-content")
		       (rows 20)
		       (cols 60))
		    ,(or input ""))
	  (p "Minimum word length: "
	     (input (@ (type "text")
		       (name "min-length"))))
	  (input (@ (type "submit")
		    (name "input-submit")
		    (value "Determine word usage")))))))

The full application is less than 100 lines long and can be found here. Try pasting in a favorite piece of writing and seeing what happens – it’s actually kind of fun!


git bugfix branches: choose the root wisely

In source code management on January 17, 2012 by gcbenison Tagged: , ,

(The impatient may jump directly to the summary or a case study.)

You’re hacking along on your project, and you discover a bug (not in whatever feature you’re working on; somewhere else in the code). You’re using git for revision control, so branching and merging are very cheap, and it makes sense to create a bugfix branch. That way you need not worry about creating a pristine patch immediately – you can fearlessly pepper your code with assertions, extra tests, and confessions, and generally tear your code apart in whatever way helps hunt down the bug, without fear of disrupting development on the main branch. Then when you’re happy with the bugfix branch, you can merge it back into the main branch (usually called ‘master’ when using git), even if more development has happened (possibly by other developers) in the meantime.

repository dependency graph, before bugfix

Figure 1. Repository, before bugfix

But what if the project has more than one branch under active development? Consider the typical scenario in Figure 1, where in addition to the master branch, there is a stable release branch  and a feature devlopment branch. Development on the feature branch is proceeding independently of the master branch. The release branch has diverged a bit, too. (Why? Perhaps bugfixes no longer relevant to the main branch; compatibility patches; etc. It happens.)  Should the bugfix branch be merged into these other active branches, as well? Certainly, if the bug affects those branches.  But in the process of merging, we should aim to pull in only our bugfix changes – the bugfix should not force us to merge our branches in a more general way at this time.  Therefore a bugfix branch emanating from the current master branch (as in Figure 2) presents a problem, because when merged with any other branch, it will pull in all of the current master branch with it.  (Basing the bugfix on the tip of any other current branch will suffer the same concerns.)

Figure 2 – bugfix branch from current masterbugfix branch from head of master

One solution is to use git cherry-pick to apply the bugfix selectively to the other branches.   This is less straightforward than simple git merge and has the disadvantage that git cherry-pick must rename each commit (since the resulting commits have new parents, they must have different names).  This makes it less clear that the same bugfix was applied to both branches.  Constraining bugfixes to single commits – possibly using git rebase to squash a branch down – can make git cherry-pick more straightforward; however, there is still the renaming problem, and I dislike the requirement of one commit per bug; it is reasonable to split up a complex bugfix into steps and to want to retain that information.

The solution I like best involves first finding the commit that introduced the bug, and then branching from there.  A command that is invaluable for this is git blame.  It is so useful, I would recommend learning it right after commit, checkout, and merge.  The resulting repository now looks like Figure 3, where we have found the source of the bug deep inside the development history.

This approach has several advantages.   In a git log of the bugfix branch, the patch now appears adjacent to the problem it attempts to solve, rather than randomly interspersed with unrelated commits.  More importantly, the bugfix branch can now be merged with any appropriate branch using a simple git merge without dragging in any unrelated commits.  The bugfix branch should be merged with any branch that contains the source of the bug (i.e. where we rooted the bugfix branch.)  An extremely useful command for this – which I learned here – is git branch --contains.  In our example, the bugfix branch should be merged with master and with the feature branch, but not with the stable release branch, which diverged from master before the bug was introduced.

After the merge (Figure 4), the head commits of the ‘master’ and ‘feature’ branches now have two parents – the previous heads of those branches, and the head of the bugfix branch.  Some may recoil at this type of complex ancestry in the master branch, preferring a linear commit history.  But why?  Handling this sort of complexity is what git is great at.

bugfix branch, merged

Figure 4 - bugfix branch, merged

bugfix branch rooted at the source of the bug

Figure 3 - bugfix branch rooted at source of bug

A brief summary of the approach:

  1. Discover bug.
  2. Identify commit that introduced bug (git blame is invaluable here.) Can use git tag to keep track of the commit, if desired.
  3. Create a bugfix branch rooted at the commit identified in step (2): git checkout -b my-bugfix-branch <source-of-bug>
  4. hack hack hack…bug fixed
  5. Identify branches affected by the bug: git branch --contains <source-of-bug>
  6. For each branch identified in (5), merge in the bugfix branch

This approach scales to more complicated repositories having many branches with which you may not be familiar. Steps 5 and 6 can be put into a script that automatically finds all the appropriate branches:

# GCB 1jan12
# Automatically merge $bugfix_branch into all branches containing $bug_root.

for b in `git branch --contains $bug_root | sed s/*//`;
  echo $b;
  git checkout $b;
  git merge $bugfix_branch;

A case study

This real-world example comes from the development of burrow-owl, a package I maintain that draws and manipulates magnetic resonance spectra.

1) Discovery of the bug:

When the documentation-generating program ‘doxygen’ was missing on the system,
running make in the burrow-owl doc/ directory would fail (exit with non-zero status),
which is a bug because such problems should be diagnosed during the configuration stage.

2) Uncovering the source of the bug:

Poking around, we find this line in

$(DOXYGEN) Doxyfile

and this in

AC_PATH_PROG(DOXYGEN, doxygen, no)

When doxygen is installed, this works as expected.  When it is absent, the ‘DOXYGEN’ variable
is set to ‘no’, which ‘make’ then tries to execute – since there is no program called ‘no’,
make fails.

3) Creating a bugfix branch

We can find the commit responsible for the lines in question using git-blame:

$ git blame
8771b190 (Greg Benison 2008-01-09 11:52:51 -0800  16) AC_CHECK_PROGS(INDENT, indent, [# ** indent not
available **])
8771b190 (Greg Benison 2008-01-09 11:52:51 -0800  17)
2a258920 (Greg Benison 2008-11-23 23:49:24 -0800  18) AC_PATH_PROG(DOXYGEN, doxygen, no)
2a258920 (Greg Benison 2008-11-23 23:49:24 -0800  19)
1288c02a (greg         2009-12-21 07:35:38 -0800  20) PKG_CHECK_MODULES(GLIB,      [glib-2.0 gobject-2.0])

It’s apparent that the line we want to change was introduced by commit 2a258,
quite far back in the history of ‘master’.  Now we look at that commit in more detail:

$ git log 2a258 -n1

commit 2a2589204454fb5f340a47ba40859f5c3357560b
Author: Greg Benison <>
Date:   Sun Nov 23 23:49:24 2008 -0800

added doxygen config file and some inline comments

Earlier commits do not use ‘doxygen’ and do not have the bug, so we create a tag at this commit
which will serve as the root of our bugfix branch:

$ git tag 2a258 bug-doxygen-missing
$ git checkout -b bug-doxygen-missing tags/bug-doxygen-missing

4) Creating the patch

Now on our bugfix branch, we make our changes, adding an option to ‘configure’ to either disable
doxygen or specify an alternate program, and if disabled to have ‘make’ skip building the documentation
rather than fail.

$ git log heads/bug-doxygen-missing --oneline

b276d6e Check during configuration for missing doxygen
f7cf4d8 Updated README.branch to contain description of the problem.
2a25892 added doxygen config file and some inline comments

Observations: Our bugfix branch consists of two commits, then the next parent is
the commit that introduced the bug rather than some other random piece of commit

5) Merging the bugfix branch

The repo contains a ‘master’ branch, where development is occuring, and two
release branches:

$ git branch
* master

So to which branches should we apply our bugfix branch?  Only to those that
actually contain the bug, i.e. that include the commit that began our bugfix
branch.  We can find these with  git branch --contains:

$ git branch --contains tags/bug-doxygen-missing
* bug-doxygen-missing

Apparently, the bug was introduced before release 1.5, but after release 1.4.  So it is
appropriate to merge the bugfix branch into master and 1.5 only.

On master, we can verify that our bugfix branch has not yet been merged, and then merge it:

$ git branch
* master

$ git branch --no-merged

$ git merge heads/bug-doxygen-missing

Similarly for branch ‘1.5’; then it would be safe but not necessary
to delete the bug-doxygen-missing branch because it has been merged
everywhere it needs to be.


How a changeset is like a vial of bacteria

In biology, Sample-naming algorithm, source code management on December 20, 2011 by gcbenison Tagged: , , , , , ,

What do source code management and keeping track of samples in a biochemistry laboratory have in common? Quite a bit, it turns out.

There are many types of samples in the lab – tubes of purified protein or DNA, liquid bacterial cultures, solid cultures, frozen cell lines, etc. Each sample is derived from one or more other samples (for example, DNA is purified from a cell culture) and can give rise to yet more samples. Yet because there is no way that a sample can somehow be a part of its own ancestry, the sample lineage relationships form an instance of a directed acyclic graph (DAG) – the same entity that can be used to describe the revision history of a body of code.

Managing such a collection requires a good way to assign names to new members as they are generated. Most importantly, the names must be unique, and should also be drawn from some well-defined domain (e.g. “six character alphanumeric string” or “40 hex digit string”). Git famously and brilliantly names each revision with a sha1 digest calculated from the revision itself. Having names calculated from content in this way eliminates the need for synchronization in allocating names. My scheme for naming samples in the lab is more traditional in that it does rely on a sort of locking: each investigator has a pre-printed sheet of unique labels for each day, and takes one from this one sheet every time a new sample is generated. Because there is only one sheet per investigator per day, and each sheet is different, there are no name collisions.

Just as git can reconstruct the revision history of any file by walking through the DAG, I can reconstruct the history of any sample in the lab. Let’s test the efficiency of the process by looking at a typical sample:

A sample tube, displaying unique name

The sample shown above is a mixture of two purified proteins (it is common for labels to include both ad-hoc descriptive language as a sort of mnemonic together with the unique code to facilitate looking up the exact history). We begin by looking up the parents of the unique code G280909E found on the sample (an O(1) operation of finding the sheet this label was taken from, in a sorted binder, and reading off the parent labels). Then we repeat the process for the sample’s parents, etc, and finally end up with this sample’s lineage:

The two samples farthest back in the ancestry are freezer stocks of bacterial strains, one corresponding to each protein found in the final sample; this is the farthest back this sample can be traced. The intermediate nodes correspond to cell culture and purification steps. Using these labels, it is possible to refer back to any notes or data and unambiguously associate it with this physical sample. So how efficient is this process? Reconstructing the above tree took nine minutes, start to finish (and I should note that this interval includes an interruption caused by a minor lab accident!) Certainly not as fast as reconstructing a file history using git, but still not too bad considering that this is one sample out of hundreds in circulation at any time.


Contigs, in Haskell.

In biology, haskell on October 27, 2011 by gcbenison Tagged: ,

A theme of this blog is that one of the best ways to understand programming algorithms is by analogies with everyday life. Today’s analogy comes from a computational problem that arose from efforts in recent decades to sequence entire genomes. A reminder: a genome consists of one or more strings of DNA bases, where each DNA base or character in the string is chosen from the set (A, T, G, C). A chromosome (or single string of DNA) can be millions of characters long. No currently-existing technology can read a sequence that long; several hundred characters at a time is more typical.

So imagine receiving a message that had been cut up into little fragments:

man who loved the outdoors, and bowling…
an. He was…he was one of us. He was a man w
ne of us. He was a man who loved the outdoors
and a good man. He was…he was one of us. H
e was a man who loved the outdoors, and bowli
was…he was one of us. He was a man who love
loved the outdoors, and bowling…
od man. He was…he was one of us. He was a m


Reassembling the original would be very difficult if just a single copy of the message had been cut up, leading to disjoint segments. But if multiple copies had been cut up, there would be fragments with overlapping portions – with enough matches, you could reassemble the original. And this is how whole genome sequences are currently solved. This leads to an important data structure for modern biology, the contig: a set of strings each associated with an offset (usually chosen to align all of the strings).

Here is a ‘contig’ data type defined in Haskell:

import Data.List

data Contig = Contig [(String, Int)]

indent 0 = ""
indent x = " " ++ (indent $ x - 1)

instance Show Contig where
  show (Contig xs) = unlines $ "" : (map indentContig $
                     sortBy (\a b -> compare (snd a) (snd b)) xs)
    where indentContig (str, x) = (indent x) ++ str

*Main> Contig [("lo world", 3),("hello", 0)]

   lo world

Building a contig requires a good choice of alignment algorithms: the fragments may contain mistakes especially at their ends, and there is usually more than one alignment possible. However with this small example even a rudimentary alignment algorithm (see below) starts to reveal the original message:

*Main> assemble fragments

 and a good man. He was...he was one of us. H
             an. He was...he was one of us. He was a man w
                    was...he was one of us. He was a man who love
                                  ne of us. He was a man who loved the outdoors
                                        us. He was a man who loved the outdoors, and
                                             e was a man who loved the outdoors, and bowli
                                                     man who loved the outdoors, and bowling...

The following is a rudimentary, brute-force algorithm for building contigs in Haskell. For genome-length data it is necessary to use much better methods for string alignment, and I will have more on that later…

interleave (x:xs)(y:ys) = x:y:(interleave xs ys)
interleave xs [] = xs
interleave [] ys = ys

firstThat::(a -> Maybe b) -> [a] -> Maybe b
firstThat op [] = Nothing
firstThat op (x:xs) = proceed candidate
  where candidate = op x
        proceed (Just y) = Just y
        proceed Nothing = firstThat op xs

-- True if either string is a prefix of the other.
isPrefix [] _ = True
isPrefix _ [] = True
isPrefix (x:xs)(y:ys) | x == y = isPrefix xs ys
isPrefix _ _ = False

matchWithOffset offset str target
      | offset < (- (length str)) = False
      | offset  (length target) = False
      | True = isPrefix str $ drop offset target

matchWithAnyOffset::String->String->Maybe Int
matchWithAnyOffset str target =
  find (\offset -> matchWithOffset offset str target) $
  interleave [0..(length target - 1)] $ map negate [1..(length str - 1)]
matchWithAnyOffsetPlus::String->(String, Int)->Maybe Int
matchWithAnyOffsetPlus str (target, target_offset) = op offset
  where op Nothing = Nothing
        op (Just x) = Just (x + target_offset)
        offset = matchWithAnyOffset str target
appendContig::Contig->String->Maybe Contig
appendContig (Contig []) str = Just $ Contig [(str, 0)]
appendContig (Contig xs) str = op $ firstThat (matchWithAnyOffsetPlus str) xs
  where op Nothing = Nothing
        op (Just x) | x > 0 = Just $ Contig ((str, x):xs)
        op (Just x) = Just $ Contig ((str, 0):(map (\(s,i)->(s,i-x)) xs))

assemble::[String]->Maybe Contig
assemble = foldr helper1 (Just $ Contig [])
           where helper1 str (Just c) = helper2 (appendContig c str) c
                 helper2 Nothing c = (Just c)
                 helper2 x _ = x


Line-unwrap: in Perl

In Line wrap algorithm, Perl on October 10, 2011 by gcbenison Tagged:

As promised, in this post I will translate the line-unwrapping algorithm from Haskell to Perl.  (Why?  Not just as an academic exercise – there are many places where a Perl implementation might be wanted, such as in a CGI program.)

The problem was described here – plain text already containing line breaks, when pasted into a fixed-width container that also breaks lines, ends up with ugly jagged right edges.  What is needed is an algorithm to strip out newlines, but not all newlines: just those mid-paragraph, leaving paragraphs, lists, etc. unchanged.

The Haskell solution consisted of a function to decide whether each line should be merged with the next one, and then a “fold” operation to iterate over the input.  The perl version of the whether-to-merge function is a pretty straightforward translation of the Haskell version.  Iteration is a different matter: due to the lack of a “fold” function in the Perl language (but see note below), I relied on custom iteration using the list operator “<>”:

while (<>)
  my $this_line = $_;
  chomp $this_line;
  if ($sentinel)
  if (($this_line eq "")
      || ($previous_line eq "")
      || ($this_line =~ /^[^A-Za-z0-9]/)
      || early_indent($previous_line, $this_line))
  { $result .= $previous_line . "\n"; }
  else {$result .= ($previous_line . " "); }
  $previous_line = $this_line;
  $sentinel = 1;

Overall, the complete Perl solution is just as succinct as the Haskell one (both are 34 lines long) – however, I find the Haskell version more readable.  The use of custom iteration in the form of Perl’s <> operator with destructive updates is less readily discoverable than a standard “fold” operation.  Also, the strangeness with the “$sentinel” variable was necessary to prevent spurious newlines or spaces from showing up before the first line.  In contrast, the Haskell version resembles a straightforward statement of the problem.  And as is often the case, once the Haskell program compiled correctly, it also ran correctly, whereas with the Perl version I had to go through several edit/test/debug cycles to make it stop doing weird things.  Of course this reflects my relative skill level in the two languages, but it also reflects the advantages of pure functional programming.  I have known Perl a lot longer than I have known Haskell.

note: Since writing this Perl code, I’ve seen that it’s pretty straightforward to write or to find an implementation of ‘fold’ in Perl.  However, it’s definitely not the bread-and-butter of the language, and it’s worth noting that Perl implementations of ‘fold’ rely internally on destructive updates – whereas in Haskell, it’s not too hard to implement ‘fold’ from scratch in a purely functional way.