Posts Tagged ‘haskell’

Articles

Contigs, in Haskell.

In biology,haskell on October 27, 2011 by gcbenison Tagged: ,

A theme of this blog is that one of the best ways to understand programming algorithms is by analogies with everyday life. Today’s analogy comes from a computational problem that arose from efforts in recent decades to sequence entire genomes. A reminder: a genome consists of one or more strings of DNA bases, where each DNA base or character in the string is chosen from the set (A, T, G, C). A chromosome (or single string of DNA) can be millions of characters long. No currently-existing technology can read a sequence that long; several hundred characters at a time is more typical.

So imagine receiving a message that had been cut up into little fragments:

man who loved the outdoors, and bowling…
an. He was…he was one of us. He was a man w
ne of us. He was a man who loved the outdoors
and a good man. He was…he was one of us. H
e was a man who loved the outdoors, and bowli
was…he was one of us. He was a man who love
loved the outdoors, and bowling…
od man. He was…he was one of us. He was a m

etc.

Reassembling the original would be very difficult if just a single copy of the message had been cut up, leading to disjoint segments. But if multiple copies had been cut up, there would be fragments with overlapping portions – with enough matches, you could reassemble the original. And this is how whole genome sequences are currently solved. This leads to an important data structure for modern biology, the contig: a set of strings each associated with an offset (usually chosen to align all of the strings).

Here is a ‘contig’ data type defined in Haskell:

import Data.List

data Contig = Contig [(String, Int)]

indent::Int->String
indent 0 = ""
indent x = " " ++ (indent $ x - 1)

instance Show Contig where
  show (Contig xs) = unlines $ "" : (map indentContig $
                     sortBy (\a b -> compare (snd a) (snd b)) xs)
    where indentContig (str, x) = (indent x) ++ str

*Main> Contig [("lo world", 3),("hello", 0)]

hello
   lo world

Building a contig requires a good choice of alignment algorithms: the fragments may contain mistakes especially at their ends, and there is usually more than one alignment possible. However with this small example even a rudimentary alignment algorithm (see below) starts to reveal the original message:

*Main> assemble fragments

 and a good man. He was...he was one of us. H
             an. He was...he was one of us. He was a man w
                    was...he was one of us. He was a man who love
                                  ne of us. He was a man who loved the outdoors
                                        us. He was a man who loved the outdoors, and
                                             e was a man who loved the outdoors, and bowli
                                                     man who loved the outdoors, and bowling...

The following is a rudimentary, brute-force algorithm for building contigs in Haskell. For genome-length data it is necessary to use much better methods for string alignment, and I will have more on that later…

interleave::[a]->[a]->[a]
interleave (x:xs)(y:ys) = x:y:(interleave xs ys)
interleave xs [] = xs
interleave [] ys = ys

firstThat::(a -> Maybe b) -> [a] -> Maybe b
firstThat op [] = Nothing
firstThat op (x:xs) = proceed candidate
  where candidate = op x
        proceed (Just y) = Just y
        proceed Nothing = firstThat op xs

-- True if either string is a prefix of the other.
isPrefix::String->String->Bool
isPrefix [] _ = True
isPrefix _ [] = True
isPrefix (x:xs)(y:ys) | x == y = isPrefix xs ys
isPrefix _ _ = False

matchWithOffset::Int->String->String->Bool
matchWithOffset offset str target
      | offset < (- (length str)) = False
      | offset  (length target) = False
      | True = isPrefix str $ drop offset target

matchWithAnyOffset::String->String->Maybe Int
matchWithAnyOffset str target =
  find (\offset -> matchWithOffset offset str target) $
  interleave [0..(length target - 1)] $ map negate [1..(length str - 1)]
  
matchWithAnyOffsetPlus::String->(String, Int)->Maybe Int
matchWithAnyOffsetPlus str (target, target_offset) = op offset
  where op Nothing = Nothing
        op (Just x) = Just (x + target_offset)
        offset = matchWithAnyOffset str target
          
appendContig::Contig->String->Maybe Contig
appendContig (Contig []) str = Just $ Contig [(str, 0)]
appendContig (Contig xs) str = op $ firstThat (matchWithAnyOffsetPlus str) xs
  where op Nothing = Nothing
        op (Just x) | x > 0 = Just $ Contig ((str, x):xs)
        op (Just x) = Just $ Contig ((str, 0):(map (\(s,i)->(s,i-x)) xs))

assemble::[String]->Maybe Contig
assemble = foldr helper1 (Just $ Contig [])
           where helper1 str (Just c) = helper2 (appendContig c str) c
                 helper2 Nothing c = (Just c)
                 helper2 x _ = x

Articles

A program to intelligently remove carriage returns (so you can paste text without having it look awful)

In Line wrap algorithm on July 3, 2011 by gcbenison Tagged: , ,

If you just have text you want to clean up… go here.

If you are interested in how the program works… keep reading.

Here is a problem you have probably experienced at some time or another: you want to paste a plain text document somewhere such as an online form, only the window is narrower than the lines in your document and the result looks a bit like an E.E. Cummings poem: the carriage returns are now sprinkled more or less randomly throughout, creating a very ragged-looking right edge. I have often experienced this with those “paste your resume here” fields in online job applications.

To illustrate, I created the following short bio of Drew Endy, a biologist I recently saw speak at my workplace, by stitching together information from his Wikipedia entry and from Pubmed. The original is sixty characters wide:

plain text posted into a form too wide.

Plain text, posted into a form that is too wide.

That looks OK, except it would be better if the text expanded to fill the available horizontal space. Now let’s paste it into a form that is too narrow. The result is a mess:

Plain text pasted into a form that is too narrow

Plain text pasted into a form that is too narrow.

I will confess that more than once I have responded to this problem by manually editing out the carriage returns – hitting backspace, space, down arrow, etc. for every line until it looks right. There has to be a better way. Couldn’t we get the computer to take out the offending carriage returns for us? (At this point the impatient reader may jump to the complete solution or the result; otherwise keep reading for an explanation.)

To get started, let’s first analyze the problem. Clearly the program should not eliminate every carriage return – paragraphs, section headings, and enumerated lists should be left intact. Since the input is a plain text document lacking explicit markup, the program sometimes must guess at the user’s intention. There will be no one correct solution, but we can come close with a few simple rules:

For every line in the document, merge it with the next line unless:

  1. This line is blank.
  2. The following line is blank.
  3. The next line begins with a non-alphanumeric character.
  4. Appending the first word of the next line to this line would result in a line still shorter than this line.

The perhaps obscure-looking fourth rule is intended to catch ‘subsection-heading’ type lines, such as “Activities:” in the example. Next, let’s translate our pseudocode solution into actual code. Here it is in Haskell:

shouldMerge::[Char]->[Char]->Bool
shouldMerge "" _ = False
shouldMerge _ "" = False
shouldMerge _ nextline | (not . isAlphaNum . head) nextline = False
shouldMerge line nextline | length (line ++ " " ++ (head . words) nextline) <
                              length nextline = False
shouldMerge _ _  = True

Each line of code is pretty much just a statement of one of the rules above – an illustration of how in Haskell, if you can clearly state the problem, you’re just about done writing your program!

To have a complete, working program though, we don’t just want a function that applies to any two lines; we want something that takes any text file as input, applies the rule as many times as needed, and produces a text file as output. Let’s build such a program around a function that takes as input a list of strings (the lines of the input file) and returns a (probably shorter) list of strings in which some of the input lines have been merged:

	main = interact (unlines . unWrapLines . lines)

	unWrapLines::[[Char]]->[[Char]]

Now how do we implement “unWarpLines” in terms of “shouldMerge”? Clearly we need to iterate over every line in the input, and it would be nice to do it with one of the standard iteration operators. To express it as a right-fold, we need to choose how we will use our accumulator variable, the result that is returned by each iteration and then passed to the next one. A natural choice might be to pass the resulting list of lines as it is built. However this won’t work because at each iteration we need to compare each line to its successor in the original input, not in the output. So instead of merging as we go during the fold, let’s just group lines that should be merged into lists, and then in a second sweep merge the grouped lines together:

unWrapLines = (map (stringJoin " ")) . innerUnWrap

innerUnWrap::[[Char]]->[[[Char]]]
innerUnWrap = foldr process []
  where process line [] = [[line]]
        process line ((x:xs):ys) | shouldMerge line x = (line:(x:xs)):ys
        process line rest = [line]:rest

The “stringJoin” function has type [Char]->[[Char]]->[Char]and simply joins lists of strings using a given delimiter. It is available with the Data.String.Utils module, or you can write one yourself in just a few lines. And that’s it — we’re done!See the complete Haskell program here. Let’s see how the example now looks, after running it through our de-line-wrapping filter and pasting it into the same two text boxes:



Much better!

I think this example demonstrates how Haskell is not just for abstract computer science, but is perfectly useful for messy real-world problems. But much as I like the Haskell solution, what I really wanted was an implementation of the de-carriage-return algorithm in Perl so I could use it in a CGI script (which I am not ready to migrate to Haskell, at least not yet.) So coming soon in a follow-up post: automatic carriage return cleansing, in Perl!

note: edited 10/9/11 to move complete code listing from WordPress to github.

note: edited 11/7/11 to spell “pseudocode” correctly (thanks to Ikem)

Follow

Get every new post delivered to your Inbox.