A theme of this blog is that one of the best ways to understand programming algorithms is by analogies with everyday life. Today’s analogy comes from a computational problem that arose from efforts in recent decades to sequence entire genomes. A reminder: a genome consists of one or more strings of DNA bases, where each DNA base or character in the string is chosen from the set (A, T, G, C). A chromosome (or single string of DNA) can be millions of characters long. No currently-existing technology can read a sequence that long; several hundred characters at a time is more typical.
So imagine receiving a message that had been cut up into little fragments:
man who loved the outdoors, and bowling…
an. He was…he was one of us. He was a man w
ne of us. He was a man who loved the outdoors
and a good man. He was…he was one of us. H
e was a man who loved the outdoors, and bowli
was…he was one of us. He was a man who love
loved the outdoors, and bowling…
od man. He was…he was one of us. He was a metc.
Reassembling the original would be very difficult if just a single copy of the message had been cut up, leading to disjoint segments. But if multiple copies had been cut up, there would be fragments with overlapping portions – with enough matches, you could reassemble the original. And this is how whole genome sequences are currently solved. This leads to an important data structure for modern biology, the contig: a set of strings each associated with an offset (usually chosen to align all of the strings).
Here is a ‘contig’ data type defined in Haskell:
import Data.List
data Contig = Contig [(String, Int)]
indent::Int->String
indent 0 = ""
indent x = " " ++ (indent $ x - 1)
instance Show Contig where
show (Contig xs) = unlines $ "" : (map indentContig $
sortBy (\a b -> compare (snd a) (snd b)) xs)
where indentContig (str, x) = (indent x) ++ str
*Main> Contig [("lo world", 3),("hello", 0)]
hello
lo world
Building a contig requires a good choice of alignment algorithms: the fragments may contain mistakes especially at their ends, and there is usually more than one alignment possible. However with this small example even a rudimentary alignment algorithm (see below) starts to reveal the original message:
*Main> assemble fragments
and a good man. He was...he was one of us. H
an. He was...he was one of us. He was a man w
was...he was one of us. He was a man who love
ne of us. He was a man who loved the outdoors
us. He was a man who loved the outdoors, and
e was a man who loved the outdoors, and bowli
man who loved the outdoors, and bowling...
The following is a rudimentary, brute-force algorithm for building contigs in Haskell. For genome-length data it is necessary to use much better methods for string alignment, and I will have more on that later…
interleave::[a]->[a]->[a]
interleave (x:xs)(y:ys) = x:y:(interleave xs ys)
interleave xs [] = xs
interleave [] ys = ys
firstThat::(a -> Maybe b) -> [a] -> Maybe b
firstThat op [] = Nothing
firstThat op (x:xs) = proceed candidate
where candidate = op x
proceed (Just y) = Just y
proceed Nothing = firstThat op xs
-- True if either string is a prefix of the other.
isPrefix::String->String->Bool
isPrefix [] _ = True
isPrefix _ [] = True
isPrefix (x:xs)(y:ys) | x == y = isPrefix xs ys
isPrefix _ _ = False
matchWithOffset::Int->String->String->Bool
matchWithOffset offset str target
| offset < (- (length str)) = False
| offset (length target) = False
| True = isPrefix str $ drop offset target
matchWithAnyOffset::String->String->Maybe Int
matchWithAnyOffset str target =
find (\offset -> matchWithOffset offset str target) $
interleave [0..(length target - 1)] $ map negate [1..(length str - 1)]
matchWithAnyOffsetPlus::String->(String, Int)->Maybe Int
matchWithAnyOffsetPlus str (target, target_offset) = op offset
where op Nothing = Nothing
op (Just x) = Just (x + target_offset)
offset = matchWithAnyOffset str target
appendContig::Contig->String->Maybe Contig
appendContig (Contig []) str = Just $ Contig [(str, 0)]
appendContig (Contig xs) str = op $ firstThat (matchWithAnyOffsetPlus str) xs
where op Nothing = Nothing
op (Just x) | x > 0 = Just $ Contig ((str, x):xs)
op (Just x) = Just $ Contig ((str, 0):(map (\(s,i)->(s,i-x)) xs))
assemble::[String]->Maybe Contig
assemble = foldr helper1 (Just $ Contig [])
where helper1 str (Just c) = helper2 (appendContig c str) c
helper2 Nothing c = (Just c)
helper2 x _ = x

