Archive for the ‘Line wrap algorithm’ Category

Articles

Line-unwrap: in Perl

In Line wrap algorithm,Perl on October 10, 2011 by gcbenison Tagged:

As promised, in this post I will translate the line-unwrapping algorithm from Haskell to Perl.  (Why?  Not just as an academic exercise – there are many places where a Perl implementation might be wanted, such as in a CGI program.)

The problem was described here – plain text already containing line breaks, when pasted into a fixed-width container that also breaks lines, ends up with ugly jagged right edges.  What is needed is an algorithm to strip out newlines, but not all newlines: just those mid-paragraph, leaving paragraphs, lists, etc. unchanged.

The Haskell solution consisted of a function to decide whether each line should be merged with the next one, and then a “fold” operation to iterate over the input.  The perl version of the whether-to-merge function is a pretty straightforward translation of the Haskell version.  Iteration is a different matter: due to the lack of a “fold” function in the Perl language (but see note below), I relied on custom iteration using the list operator “<>”:

while (<>)
{
  my $this_line = $_;
  chomp $this_line;
  if ($sentinel)
  {
  if (($this_line eq "")
      || ($previous_line eq "")
      || ($this_line =~ /^[^A-Za-z0-9]/)
      || early_indent($previous_line, $this_line))
  { $result .= $previous_line . "\n"; }
  else {$result .= ($previous_line . " "); }
  }
  $previous_line = $this_line;
  $sentinel = 1;
}

Overall, the complete Perl solution is just as succinct as the Haskell one (both are 34 lines long) – however, I find the Haskell version more readable.  The use of custom iteration in the form of Perl’s <> operator with destructive updates is less readily discoverable than a standard “fold” operation.  Also, the strangeness with the “$sentinel” variable was necessary to prevent spurious newlines or spaces from showing up before the first line.  In contrast, the Haskell version resembles a straightforward statement of the problem.  And as is often the case, once the Haskell program compiled correctly, it also ran correctly, whereas with the Perl version I had to go through several edit/test/debug cycles to make it stop doing weird things.  Of course this reflects my relative skill level in the two languages, but it also reflects the advantages of pure functional programming.  I have known Perl a lot longer than I have known Haskell.

note: Since writing this Perl code, I’ve seen that it’s pretty straightforward to write or to find an implementation of ‘fold’ in Perl.  However, it’s definitely not the bread-and-butter of the language, and it’s worth noting that Perl implementations of ‘fold’ rely internally on destructive updates – whereas in Haskell, it’s not too hard to implement ‘fold’ from scratch in a purely functional way.

Articles

A program to intelligently remove carriage returns (so you can paste text without having it look awful)

In Line wrap algorithm on July 3, 2011 by gcbenison Tagged: , ,

If you just have text you want to clean up… go here.

If you are interested in how the program works… keep reading.

Here is a problem you have probably experienced at some time or another: you want to paste a plain text document somewhere such as an online form, only the window is narrower than the lines in your document and the result looks a bit like an E.E. Cummings poem: the carriage returns are now sprinkled more or less randomly throughout, creating a very ragged-looking right edge. I have often experienced this with those “paste your resume here” fields in online job applications.

To illustrate, I created the following short bio of Drew Endy, a biologist I recently saw speak at my workplace, by stitching together information from his Wikipedia entry and from Pubmed. The original is sixty characters wide:

plain text posted into a form too wide.

Plain text, posted into a form that is too wide.

That looks OK, except it would be better if the text expanded to fill the available horizontal space. Now let’s paste it into a form that is too narrow. The result is a mess:

Plain text pasted into a form that is too narrow

Plain text pasted into a form that is too narrow.

I will confess that more than once I have responded to this problem by manually editing out the carriage returns – hitting backspace, space, down arrow, etc. for every line until it looks right. There has to be a better way. Couldn’t we get the computer to take out the offending carriage returns for us? (At this point the impatient reader may jump to the complete solution or the result; otherwise keep reading for an explanation.)

To get started, let’s first analyze the problem. Clearly the program should not eliminate every carriage return – paragraphs, section headings, and enumerated lists should be left intact. Since the input is a plain text document lacking explicit markup, the program sometimes must guess at the user’s intention. There will be no one correct solution, but we can come close with a few simple rules:

For every line in the document, merge it with the next line unless:

  1. This line is blank.
  2. The following line is blank.
  3. The next line begins with a non-alphanumeric character.
  4. Appending the first word of the next line to this line would result in a line still shorter than this line.

The perhaps obscure-looking fourth rule is intended to catch ‘subsection-heading’ type lines, such as “Activities:” in the example. Next, let’s translate our pseudocode solution into actual code. Here it is in Haskell:

shouldMerge::[Char]->[Char]->Bool
shouldMerge "" _ = False
shouldMerge _ "" = False
shouldMerge _ nextline | (not . isAlphaNum . head) nextline = False
shouldMerge line nextline | length (line ++ " " ++ (head . words) nextline) <
                              length nextline = False
shouldMerge _ _  = True

Each line of code is pretty much just a statement of one of the rules above – an illustration of how in Haskell, if you can clearly state the problem, you’re just about done writing your program!

To have a complete, working program though, we don’t just want a function that applies to any two lines; we want something that takes any text file as input, applies the rule as many times as needed, and produces a text file as output. Let’s build such a program around a function that takes as input a list of strings (the lines of the input file) and returns a (probably shorter) list of strings in which some of the input lines have been merged:

	main = interact (unlines . unWrapLines . lines)

	unWrapLines::[[Char]]->[[Char]]

Now how do we implement “unWarpLines” in terms of “shouldMerge”? Clearly we need to iterate over every line in the input, and it would be nice to do it with one of the standard iteration operators. To express it as a right-fold, we need to choose how we will use our accumulator variable, the result that is returned by each iteration and then passed to the next one. A natural choice might be to pass the resulting list of lines as it is built. However this won’t work because at each iteration we need to compare each line to its successor in the original input, not in the output. So instead of merging as we go during the fold, let’s just group lines that should be merged into lists, and then in a second sweep merge the grouped lines together:

unWrapLines = (map (stringJoin " ")) . innerUnWrap

innerUnWrap::[[Char]]->[[[Char]]]
innerUnWrap = foldr process []
  where process line [] = [[line]]
        process line ((x:xs):ys) | shouldMerge line x = (line:(x:xs)):ys
        process line rest = [line]:rest

The “stringJoin” function has type [Char]->[[Char]]->[Char]and simply joins lists of strings using a given delimiter. It is available with the Data.String.Utils module, or you can write one yourself in just a few lines. And that’s it — we’re done!See the complete Haskell program here. Let’s see how the example now looks, after running it through our de-line-wrapping filter and pasting it into the same two text boxes:



Much better!

I think this example demonstrates how Haskell is not just for abstract computer science, but is perfectly useful for messy real-world problems. But much as I like the Haskell solution, what I really wanted was an implementation of the de-carriage-return algorithm in Perl so I could use it in a CGI script (which I am not ready to migrate to Haskell, at least not yet.) So coming soon in a follow-up post: automatic carriage return cleansing, in Perl!

note: edited 10/9/11 to move complete code listing from WordPress to github.

note: edited 11/7/11 to spell “pseudocode” correctly (thanks to Ikem)

Follow

Get every new post delivered to your Inbox.