Archive for October, 2011

Articles

Contigs, in Haskell.

In biology,haskell on October 27, 2011 by gcbenison Tagged: ,

A theme of this blog is that one of the best ways to understand programming algorithms is by analogies with everyday life. Today’s analogy comes from a computational problem that arose from efforts in recent decades to sequence entire genomes. A reminder: a genome consists of one or more strings of DNA bases, where each DNA base or character in the string is chosen from the set (A, T, G, C). A chromosome (or single string of DNA) can be millions of characters long. No currently-existing technology can read a sequence that long; several hundred characters at a time is more typical.

So imagine receiving a message that had been cut up into little fragments:

man who loved the outdoors, and bowling…
an. He was…he was one of us. He was a man w
ne of us. He was a man who loved the outdoors
and a good man. He was…he was one of us. H
e was a man who loved the outdoors, and bowli
was…he was one of us. He was a man who love
loved the outdoors, and bowling…
od man. He was…he was one of us. He was a m

etc.

Reassembling the original would be very difficult if just a single copy of the message had been cut up, leading to disjoint segments. But if multiple copies had been cut up, there would be fragments with overlapping portions – with enough matches, you could reassemble the original. And this is how whole genome sequences are currently solved. This leads to an important data structure for modern biology, the contig: a set of strings each associated with an offset (usually chosen to align all of the strings).

Here is a ‘contig’ data type defined in Haskell:

import Data.List

data Contig = Contig [(String, Int)]

indent::Int->String
indent 0 = ""
indent x = " " ++ (indent $ x - 1)

instance Show Contig where
  show (Contig xs) = unlines $ "" : (map indentContig $
                     sortBy (\a b -> compare (snd a) (snd b)) xs)
    where indentContig (str, x) = (indent x) ++ str

*Main> Contig [("lo world", 3),("hello", 0)]

hello
   lo world

Building a contig requires a good choice of alignment algorithms: the fragments may contain mistakes especially at their ends, and there is usually more than one alignment possible. However with this small example even a rudimentary alignment algorithm (see below) starts to reveal the original message:

*Main> assemble fragments

 and a good man. He was...he was one of us. H
             an. He was...he was one of us. He was a man w
                    was...he was one of us. He was a man who love
                                  ne of us. He was a man who loved the outdoors
                                        us. He was a man who loved the outdoors, and
                                             e was a man who loved the outdoors, and bowli
                                                     man who loved the outdoors, and bowling...

The following is a rudimentary, brute-force algorithm for building contigs in Haskell. For genome-length data it is necessary to use much better methods for string alignment, and I will have more on that later…

interleave::[a]->[a]->[a]
interleave (x:xs)(y:ys) = x:y:(interleave xs ys)
interleave xs [] = xs
interleave [] ys = ys

firstThat::(a -> Maybe b) -> [a] -> Maybe b
firstThat op [] = Nothing
firstThat op (x:xs) = proceed candidate
  where candidate = op x
        proceed (Just y) = Just y
        proceed Nothing = firstThat op xs

-- True if either string is a prefix of the other.
isPrefix::String->String->Bool
isPrefix [] _ = True
isPrefix _ [] = True
isPrefix (x:xs)(y:ys) | x == y = isPrefix xs ys
isPrefix _ _ = False

matchWithOffset::Int->String->String->Bool
matchWithOffset offset str target
      | offset < (- (length str)) = False
      | offset  (length target) = False
      | True = isPrefix str $ drop offset target

matchWithAnyOffset::String->String->Maybe Int
matchWithAnyOffset str target =
  find (\offset -> matchWithOffset offset str target) $
  interleave [0..(length target - 1)] $ map negate [1..(length str - 1)]
  
matchWithAnyOffsetPlus::String->(String, Int)->Maybe Int
matchWithAnyOffsetPlus str (target, target_offset) = op offset
  where op Nothing = Nothing
        op (Just x) = Just (x + target_offset)
        offset = matchWithAnyOffset str target
          
appendContig::Contig->String->Maybe Contig
appendContig (Contig []) str = Just $ Contig [(str, 0)]
appendContig (Contig xs) str = op $ firstThat (matchWithAnyOffsetPlus str) xs
  where op Nothing = Nothing
        op (Just x) | x > 0 = Just $ Contig ((str, x):xs)
        op (Just x) = Just $ Contig ((str, 0):(map (\(s,i)->(s,i-x)) xs))

assemble::[String]->Maybe Contig
assemble = foldr helper1 (Just $ Contig [])
           where helper1 str (Just c) = helper2 (appendContig c str) c
                 helper2 Nothing c = (Just c)
                 helper2 x _ = x

Articles

Line-unwrap: in Perl

In Line wrap algorithm,Perl on October 10, 2011 by gcbenison Tagged:

As promised, in this post I will translate the line-unwrapping algorithm from Haskell to Perl.  (Why?  Not just as an academic exercise – there are many places where a Perl implementation might be wanted, such as in a CGI program.)

The problem was described here – plain text already containing line breaks, when pasted into a fixed-width container that also breaks lines, ends up with ugly jagged right edges.  What is needed is an algorithm to strip out newlines, but not all newlines: just those mid-paragraph, leaving paragraphs, lists, etc. unchanged.

The Haskell solution consisted of a function to decide whether each line should be merged with the next one, and then a “fold” operation to iterate over the input.  The perl version of the whether-to-merge function is a pretty straightforward translation of the Haskell version.  Iteration is a different matter: due to the lack of a “fold” function in the Perl language (but see note below), I relied on custom iteration using the list operator “<>”:

while (<>)
{
  my $this_line = $_;
  chomp $this_line;
  if ($sentinel)
  {
  if (($this_line eq "")
      || ($previous_line eq "")
      || ($this_line =~ /^[^A-Za-z0-9]/)
      || early_indent($previous_line, $this_line))
  { $result .= $previous_line . "\n"; }
  else {$result .= ($previous_line . " "); }
  }
  $previous_line = $this_line;
  $sentinel = 1;
}

Overall, the complete Perl solution is just as succinct as the Haskell one (both are 34 lines long) – however, I find the Haskell version more readable.  The use of custom iteration in the form of Perl’s <> operator with destructive updates is less readily discoverable than a standard “fold” operation.  Also, the strangeness with the “$sentinel” variable was necessary to prevent spurious newlines or spaces from showing up before the first line.  In contrast, the Haskell version resembles a straightforward statement of the problem.  And as is often the case, once the Haskell program compiled correctly, it also ran correctly, whereas with the Perl version I had to go through several edit/test/debug cycles to make it stop doing weird things.  Of course this reflects my relative skill level in the two languages, but it also reflects the advantages of pure functional programming.  I have known Perl a lot longer than I have known Haskell.

note: Since writing this Perl code, I’ve seen that it’s pretty straightforward to write or to find an implementation of ‘fold’ in Perl.  However, it’s definitely not the bread-and-butter of the language, and it’s worth noting that Perl implementations of ‘fold’ rely internally on destructive updates – whereas in Haskell, it’s not too hard to implement ‘fold’ from scratch in a purely functional way.