Articles

Contigs, in Haskell.

In biology, haskell on October 27, 2011 by gcbenison Tagged: ,

A theme of this blog is that one of the best ways to understand programming algorithms is by analogies with everyday life. Today’s analogy comes from a computational problem that arose from efforts in recent decades to sequence entire genomes. A reminder: a genome consists of one or more strings of DNA bases, where each DNA base or character in the string is chosen from the set (A, T, G, C). A chromosome (or single string of DNA) can be millions of characters long. No currently-existing technology can read a sequence that long; several hundred characters at a time is more typical.

So imagine receiving a message that had been cut up into little fragments:

man who loved the outdoors, and bowling…
an. He was…he was one of us. He was a man w
ne of us. He was a man who loved the outdoors
and a good man. He was…he was one of us. H
e was a man who loved the outdoors, and bowli
was…he was one of us. He was a man who love
loved the outdoors, and bowling…
od man. He was…he was one of us. He was a m

etc.

Reassembling the original would be very difficult if just a single copy of the message had been cut up, leading to disjoint segments. But if multiple copies had been cut up, there would be fragments with overlapping portions – with enough matches, you could reassemble the original. And this is how whole genome sequences are currently solved. This leads to an important data structure for modern biology, the contig: a set of strings each associated with an offset (usually chosen to align all of the strings).

Here is a ‘contig’ data type defined in Haskell:

import Data.List

data Contig = Contig [(String, Int)]

indent::Int->String
indent 0 = ""
indent x = " " ++ (indent $ x - 1)

instance Show Contig where
  show (Contig xs) = unlines $ "" : (map indentContig $
                     sortBy (\a b -> compare (snd a) (snd b)) xs)
    where indentContig (str, x) = (indent x) ++ str

*Main> Contig [("lo world", 3),("hello", 0)]

hello
   lo world

Building a contig requires a good choice of alignment algorithms: the fragments may contain mistakes especially at their ends, and there is usually more than one alignment possible. However with this small example even a rudimentary alignment algorithm (see below) starts to reveal the original message:

*Main> assemble fragments

 and a good man. He was...he was one of us. H
             an. He was...he was one of us. He was a man w
                    was...he was one of us. He was a man who love
                                  ne of us. He was a man who loved the outdoors
                                        us. He was a man who loved the outdoors, and
                                             e was a man who loved the outdoors, and bowli
                                                     man who loved the outdoors, and bowling...

The following is a rudimentary, brute-force algorithm for building contigs in Haskell. For genome-length data it is necessary to use much better methods for string alignment, and I will have more on that later…

interleave::[a]->[a]->[a]
interleave (x:xs)(y:ys) = x:y:(interleave xs ys)
interleave xs [] = xs
interleave [] ys = ys

firstThat::(a -> Maybe b) -> [a] -> Maybe b
firstThat op [] = Nothing
firstThat op (x:xs) = proceed candidate
  where candidate = op x
        proceed (Just y) = Just y
        proceed Nothing = firstThat op xs

-- True if either string is a prefix of the other.
isPrefix::String->String->Bool
isPrefix [] _ = True
isPrefix _ [] = True
isPrefix (x:xs)(y:ys) | x == y = isPrefix xs ys
isPrefix _ _ = False

matchWithOffset::Int->String->String->Bool
matchWithOffset offset str target
      | offset < (- (length str)) = False
      | offset  (length target) = False
      | True = isPrefix str $ drop offset target

matchWithAnyOffset::String->String->Maybe Int
matchWithAnyOffset str target =
  find (\offset -> matchWithOffset offset str target) $
  interleave [0..(length target - 1)] $ map negate [1..(length str - 1)]
  
matchWithAnyOffsetPlus::String->(String, Int)->Maybe Int
matchWithAnyOffsetPlus str (target, target_offset) = op offset
  where op Nothing = Nothing
        op (Just x) = Just (x + target_offset)
        offset = matchWithAnyOffset str target
          
appendContig::Contig->String->Maybe Contig
appendContig (Contig []) str = Just $ Contig [(str, 0)]
appendContig (Contig xs) str = op $ firstThat (matchWithAnyOffsetPlus str) xs
  where op Nothing = Nothing
        op (Just x) | x > 0 = Just $ Contig ((str, x):xs)
        op (Just x) = Just $ Contig ((str, 0):(map (\(s,i)->(s,i-x)) xs))

assemble::[String]->Maybe Contig
assemble = foldr helper1 (Just $ Contig [])
           where helper1 str (Just c) = helper2 (appendContig c str) c
                 helper2 Nothing c = (Just c)
                 helper2 x _ = x

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: