A program to intelligently remove carriage returns (so you can paste text without having it look awful)

In Line wrap algorithm on July 3, 2011 by gcbenison Tagged: , ,

If you just have text you want to clean up… go here.

If you are interested in how the program works… keep reading.

Here is a problem you have probably experienced at some time or another: you want to paste a plain text document somewhere such as an online form, only the window is narrower than the lines in your document and the result looks a bit like an E.E. Cummings poem: the carriage returns are now sprinkled more or less randomly throughout, creating a very ragged-looking right edge. I have often experienced this with those “paste your resume here” fields in online job applications.

To illustrate, I created the following short bio of Drew Endy, a biologist I recently saw speak at my workplace, by stitching together information from his Wikipedia entry and from Pubmed. The original is sixty characters wide:

plain text posted into a form too wide.

Plain text, posted into a form that is too wide.

That looks OK, except it would be better if the text expanded to fill the available horizontal space. Now let’s paste it into a form that is too narrow. The result is a mess:

Plain text pasted into a form that is too narrow

Plain text pasted into a form that is too narrow.

I will confess that more than once I have responded to this problem by manually editing out the carriage returns – hitting backspace, space, down arrow, etc. for every line until it looks right. There has to be a better way. Couldn’t we get the computer to take out the offending carriage returns for us? (At this point the impatient reader may jump to the complete solution or the result; otherwise keep reading for an explanation.)

To get started, let’s first analyze the problem. Clearly the program should not eliminate every carriage return – paragraphs, section headings, and enumerated lists should be left intact. Since the input is a plain text document lacking explicit markup, the program sometimes must guess at the user’s intention. There will be no one correct solution, but we can come close with a few simple rules:

For every line in the document, merge it with the next line unless:

  1. This line is blank.
  2. The following line is blank.
  3. The next line begins with a non-alphanumeric character.
  4. Appending the first word of the next line to this line would result in a line still shorter than this line.

The perhaps obscure-looking fourth rule is intended to catch ‘subsection-heading’ type lines, such as “Activities:” in the example. Next, let’s translate our pseudocode solution into actual code. Here it is in Haskell:

shouldMerge "" _ = False
shouldMerge _ "" = False
shouldMerge _ nextline | (not . isAlphaNum . head) nextline = False
shouldMerge line nextline | length (line ++ " " ++ (head . words) nextline) <
                              length nextline = False
shouldMerge _ _  = True

Each line of code is pretty much just a statement of one of the rules above – an illustration of how in Haskell, if you can clearly state the problem, you’re just about done writing your program!

To have a complete, working program though, we don’t just want a function that applies to any two lines; we want something that takes any text file as input, applies the rule as many times as needed, and produces a text file as output. Let’s build such a program around a function that takes as input a list of strings (the lines of the input file) and returns a (probably shorter) list of strings in which some of the input lines have been merged:

	main = interact (unlines . unWrapLines . lines)


Now how do we implement “unWarpLines” in terms of “shouldMerge”? Clearly we need to iterate over every line in the input, and it would be nice to do it with one of the standard iteration operators. To express it as a right-fold, we need to choose how we will use our accumulator variable, the result that is returned by each iteration and then passed to the next one. A natural choice might be to pass the resulting list of lines as it is built. However this won’t work because at each iteration we need to compare each line to its successor in the original input, not in the output. So instead of merging as we go during the fold, let’s just group lines that should be merged into lists, and then in a second sweep merge the grouped lines together:

unWrapLines = (map (stringJoin " ")) . innerUnWrap

innerUnWrap = foldr process []
  where process line [] = [[line]]
        process line ((x:xs):ys) | shouldMerge line x = (line:(x:xs)):ys
        process line rest = [line]:rest

The “stringJoin” function has type [Char]->[[Char]]->[Char]and simply joins lists of strings using a given delimiter. It is available with the Data.String.Utils module, or you can write one yourself in just a few lines. And that’s it — we’re done!See the complete Haskell program here. Let’s see how the example now looks, after running it through our de-line-wrapping filter and pasting it into the same two text boxes:

Much better!

I think this example demonstrates how Haskell is not just for abstract computer science, but is perfectly useful for messy real-world problems. But much as I like the Haskell solution, what I really wanted was an implementation of the de-carriage-return algorithm in Perl so I could use it in a CGI script (which I am not ready to migrate to Haskell, at least not yet.) So coming soon in a follow-up post: automatic carriage return cleansing, in Perl!

note: edited 10/9/11 to move complete code listing from WordPress to github.

note: edited 11/7/11 to spell “pseudocode” correctly (thanks to Ikem)


2 Responses to “A program to intelligently remove carriage returns (so you can paste text without having it look awful)”

  1. > Next, let’s translate our _psuedocode_ solution into actual code.

  2. When I use the web version of your cleanup macro, I find an extra space is inserted where the CR has been removed. Might there be a fix for that? In any case, thank you for this unique solution to a major annoyance.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: