A theme of this blog is that one of the best ways to understand programming algorithms is by analogies with everyday life. Today’s analogy comes from a computational problem that arose from efforts in recent decades to sequence entire genomes. A reminder: a genome consists of one or more strings of DNA bases, where each DNA base or character in the string is chosen from the set (A, T, G, C). A chromosome (or single string of DNA) can be millions of characters long. No currently-existing technology can read a sequence that long; several hundred characters at a time is more typical.
So imagine receiving a message that had been cut up into little fragments:
man who loved the outdoors, and bowling…
an. He was…he was one of us. He was a man w
ne of us. He was a man who loved the outdoors
and a good man. He was…he was one of us. H
e was a man who loved the outdoors, and bowli
was…he was one of us. He was a man who love
loved the outdoors, and bowling…
od man. He was…he was one of us. He was a metc.
Reassembling the original would be very difficult if just a single copy of the message had been cut up, leading to disjoint segments. But if multiple copies had been cut up, there would be fragments with overlapping portions – with enough matches, you could reassemble the original. And this is how whole genome sequences are currently solved. This leads to an important data structure for modern biology, the contig: a set of strings each associated with an offset (usually chosen to align all of the strings).
Here is a ‘contig’ data type defined in Haskell:
import Data.List data Contig = Contig [(String, Int)] indent::Int->String indent 0 = "" indent x = " " ++ (indent $ x - 1) instance Show Contig where show (Contig xs) = unlines $ "" : (map indentContig $ sortBy (\a b -> compare (snd a) (snd b)) xs) where indentContig (str, x) = (indent x) ++ str
*Main> Contig [("lo world", 3),("hello", 0)] hello lo world
Building a contig requires a good choice of alignment algorithms: the fragments may contain mistakes especially at their ends, and there is usually more than one alignment possible. However with this small example even a rudimentary alignment algorithm (see below) starts to reveal the original message:
*Main> assemble fragments and a good man. He was...he was one of us. H an. He was...he was one of us. He was a man w was...he was one of us. He was a man who love ne of us. He was a man who loved the outdoors us. He was a man who loved the outdoors, and e was a man who loved the outdoors, and bowli man who loved the outdoors, and bowling...
The following is a rudimentary, brute-force algorithm for building contigs in Haskell. For genome-length data it is necessary to use much better methods for string alignment, and I will have more on that later…
interleave::[a]->[a]->[a] interleave (x:xs)(y:ys) = x:y:(interleave xs ys) interleave xs [] = xs interleave [] ys = ys firstThat::(a -> Maybe b) -> [a] -> Maybe b firstThat op [] = Nothing firstThat op (x:xs) = proceed candidate where candidate = op x proceed (Just y) = Just y proceed Nothing = firstThat op xs -- True if either string is a prefix of the other. isPrefix::String->String->Bool isPrefix [] _ = True isPrefix _ [] = True isPrefix (x:xs)(y:ys) | x == y = isPrefix xs ys isPrefix _ _ = False matchWithOffset::Int->String->String->Bool matchWithOffset offset str target | offset < (- (length str)) = False | offset (length target) = False | True = isPrefix str $ drop offset target matchWithAnyOffset::String->String->Maybe Int matchWithAnyOffset str target = find (\offset -> matchWithOffset offset str target) $ interleave [0..(length target - 1)] $ map negate [1..(length str - 1)] matchWithAnyOffsetPlus::String->(String, Int)->Maybe Int matchWithAnyOffsetPlus str (target, target_offset) = op offset where op Nothing = Nothing op (Just x) = Just (x + target_offset) offset = matchWithAnyOffset str target appendContig::Contig->String->Maybe Contig appendContig (Contig []) str = Just $ Contig [(str, 0)] appendContig (Contig xs) str = op $ firstThat (matchWithAnyOffsetPlus str) xs where op Nothing = Nothing op (Just x) | x > 0 = Just $ Contig ((str, x):xs) op (Just x) = Just $ Contig ((str, 0):(map (\(s,i)->(s,i-x)) xs)) assemble::[String]->Maybe Contig assemble = foldr helper1 (Just $ Contig []) where helper1 str (Just c) = helper2 (appendContig c str) c helper2 Nothing c = (Just c) helper2 x _ = x