Posts Tagged ‘programming’

Articles

The logic bomb that helped me embrace free scientific software

In Free software,science on May 15, 2012 by gcbenison Tagged: , , , ,

Scientists often write custom, domain-specific computer programs as part of their research.  But the source code to those programs – when it is released at all – is commonly released under restrictive licenses that prevent redistribution or modification.  Recent editorials in the two most prominent scientific journals arguing for greater sharing of scientific source code have drawn some long-overdue attention to this issue (Reviews of the editorials are freely available online, but both original articles – even while arguing for more openness in science – sit behind massive paywalls, an irony that has been picked up by other observers as well.  Open access is an important issue, but it is a topic for another post.)  The common practice of publishing scientific conclusions based on code that is not itself made public has long troubled me.  One incident in particular brought the issue into sharp focus in my mind.

A defining moment

In the summer of 2003 I spent two months in Puerto Rico to be with my wife, who was doing her dissertation research on shrimp that live in the streams there.  I was a graduate student too at the time, trying to finish my own degree in chemistry.  I knew that during the trip I would have a lot of “down time” holed up in our apartment, which I planned to spend analysing data I had accumulated over several years. Before heading for the tropics, I made sure I had all of my data with me, and the tools I needed to work with it.

One of those tools was a program for translating the raw Nuclear Magnetic Resonance (NMR) data generated by instruments into a more useful representation.  This program, which had been written by scientists working at the National Institutes of Health, was available gratis to researchers and widely used in my field.  I relied on this program, so before leaving for the tropics I made sure that my copy was good.  So, it came as a great surprise several days into my visit when the program abruptly stopped working: it contained a logic bomb. Past a certain date, it would refuse to start normally, instead attempting to send you an email telling you to download the next release.

The program’s authors would claim that the logic bomb was innocuous, that it’s only purpose was to reduce the burden of supporting multiple versions by keeping all users in sync.  They promised to always provide the next release for free, and they assumed that most users would be in a university setting with a high-speed internet connection, so that the forced upgrade would be a simple matter completed in a couple of minutes.  But in my case their assumptions turned out to be wrong.  The nearest internet connection was a half day’s drive away, slow, and unreliable.  I could not reasonably ask my wife to interrupt her field work schedule so that I could attempt an upgrade, which might not even succeed.  Besides, I had planned in advance to have everything I needed with me, and was annoyed at having my plans derailed by this logic bomb.  And despite having programming skills and a decent development environment available to me, I could not solve my problem simply by editing the program’s source, because I did not have it – the authors declined to distribute that.  That wasn’t some kind of oversight on their part: in a startup message displayed by the program, the authors made clear their intention to prevent any redistribution or modification by users, even to solve their own problems:

“This program and its related software is based  on  a  design  originating in the NIH Laboratory of Chemical Physics, NIDDK.  It is not to be distributed or modified.”

This was a radicalizing, formative moment for me, my equivalent of the encounter between the young Richard Stallman and a malfunctioning laser printer.  I was already a user of Linux and GNU and I had started to learn about the Free Software movement, but it took this incident to show me clearly how attempting to control the behavior of users – even in benign, relatively mild ways like forced upgrades – can have unintended consequences.  And I began to question whether the motivation behind the restrictions on this code really were so benign.  After all, the program’s authors were also my potential competitors.  They said they would always provide future releases freely, but nothing required that they keep this promise. What if they one day decided to stop providing their upgrades?  What if they only provided it to certain people of their choosing?  The user community that had built up around this program was dependent on its original authors in a way that is unusual in science.  By banning me or anyone else from building on their work, and publishing the results, they were ensuring that any advances would come from them. It’s normal in science to care deeply about getting proper credit for one’s work and to reserve the right to choose the time and place of publication.  But it is not normal, once you’ve published a method, to try to prevent others from reading about it, modifying it, and publishing their modifications.  Not normal, that is, unless the method happens to be in the form of a computer program, which is just a method of converting input to output.

Broader questions aside, I still was faced with the immediate problem of not being able to do my work.  Not ready to accept the ethical or legal validity of the restrictions that had been placed on me, I set out to solve my own problem however I could.  And as it so often turns out, these programmers’ desire to control their users’ behavior through technical means exceeded their technical ability to do so.

Disabling the logic bomb

I quickly figured out that I could get the program to run by setting the system clock back a month.  But this was inconvenient, led to incorrect time stamps on things, and led to all sorts of other problems.  So I set out to circumvent the logic bomb in the program itself.  I was a chemistry student, not a computer science student, but even at that stage in my career I at least knew enough about programming to surmise that the program might call “time()” somewhere and compare the result to a fixed value.  I fired up the program in a debugger, and saw this:

#0  0xb7ed10f5 in time () from /lib/libc.so.6
#1  0x0809a6c3 in pastDate2 ()
#2  0x0809a5b9 in pastDate ()
#3  0x0804c285 in main ()

OK, not a whole lot of subterfuge there.  What if I just made “pastDate()” return right away?  I did this:

 0809a5a4 :
  809a5a4:      55                      push   %ebp
  809a5a5:      89 e5                   mov    %esp,%ebp
- 809a5a7:      53                      push   %ebx
- 809a5a8:      83 ec 08                sub    $0x8,%esp
- 809a5ab:      68 d6 07 00 00          push   $0x7d6
+ 809a5a7:      c9                      leave  
+ 809a5a8:      c3                      ret    
+ 809a5a9:      ec                      in     (%dx),%al
+ 809a5aa:      08 68 d6                or     %ch,-0x2a(%eax)
+ 809a5ad:      07                      pop    %es
+ 809a5ae:      00 00                   add    %al,(%eax)
  809a5b0:      6a 01                   push   $0x1
  809a5b2:      6a 08                   push   $0x8
  809a5b4:      e8 d3 00 00 00          call   809a68c

And it worked – it took a change of two bytes to disable the logic bomb.  Incidentally, I continued using the program for several more years and never did upgrade.  I never saw the need to.

Why is scientific software so closed?

Every field of science today is dependent on computation, but research being what it is, there are often computational needs that can’t be met by any widely-available, well-known tool.  So scientists who know some programming tend to write these niche, domain-specific applications themselves.  But scientists who know some programming often aren’t very familiar with best practices in software development, and that includes licensing – many scientists, even those who program, have never heard of the GPL.

As a result, such niche scientific programs, when they are released at all, are often released under custom licenses that were written without much thought.  And these custom licenses, more often than not, are quite restrictive.  Why?  There are several reasons.  Some scientists, conscious of their limitations as programmers, don’t want their source code seen out of fear of embarrassment (reputation is everything in science.)  Some are concerned about the potential extra work of supporting a user base if their programs are distributed. Some scientists – or their institution’s intellectual property office
– may be holding on to the hope of some day profiting from the code, though this hope is forlorn for most niche, domain-specific code that is only of interest to a handful of colleagues.  Some may actually want to make it harder for others to build on their work, and thus increase their own standing.  That kind of territorial attitude is out of place in science and can actually backfire.

I don’t know if the GPL is the answer to everything in scientific licensing, though I have used it for my own most significant scientific software.  I do know that it should be used more often than it is, and that scientific licensing in general right now tends to be too restrictive.  And I know that the scientific community isn’t thinking about licensing enough.  Even the editorials in Science and Nature are another illustration of this.  They call for distribution of source code that underlies published results.  But access to source code is only part of what software licensing is about.  What about derivative works?  What about warranties?  What about redistribution?

Science depends on the exchange of data, information, and ideas.  In a sense that’s what science is.  More and more, computers are what we use to do this.  So we need to be thinking about standards of behavior for how we use them.  “The code is mine, all mine” won’t cut it.

Articles

The Nerdy Stuff Matters

In algorithms on March 28, 2012 by gcbenison Tagged: , ,

A childhood friend who grew up to be an artist has created many pieces based on letters and words – not just their meanings, but their visual form and their relationships to each other.  Here is one of them:

YES - YET - GET - GOT - GO - NO

The word “yes” changes into the word “no” in five steps, where each step involves either changing, adding, or deleting exactly one letter, with the result of each step being a valid English word.  Upon seeing this I – being a computer programmer – immediately wondered if I could come up with an algorithm to find such sequences in a more general way.  That is, given a starting word and an ending word, could I design an algorithm to transform one word into the other by changing, adding, or deleting exactly one letter at a time, with the result being a valid English word at each step?

After some thought, I did come up with an approach and soon had a working implementation (fair warning: the link is just the source code for a program that does the job; you have to download and compile it to use it.  I’m working on an online version.)  Here is an example of its output:

  • table
  • able
  • ale
  • all
  • hall
  • hail
  • hair
  • chair

Though I had a working program I wanted to see how other programmers would approach the problem, so I posted it as a challenge on Stackoverflow, a site where programmers exchange questions and answers.  As it turns out, this question is a quite common programming challenge (although the rules about adding and deleting a character are most often left out – so that “cat hat hot” would be a valid sequence, but “cat chat that than tan” would not be.  In this form the sequences are called word ladders.)  Variants have appeared as an interview question.  Within minutes of posting, I received answers describing essentially the same approach I had used: “Represent the word relationships as a graph, and use a shortest path algorithm.  Duh.”

Word Ladders as Applied Graph Theory

For those readers not steeped in discrete math and computer science, some definitions are in order.  A graph is a collection of vertices – in this case, the words – and edges – in this case, the transformations relating the words to each other.  Such a graph formed from an English dictionary can be visualized as quite a large web spreading through space; here is a small part of it -

portion of the English language word graph

Once such a graph is defined, the problem of finding the desired sequence of words becomes the problem of finding the shortest possible path between two points in the graph.  This is not an easy problem, but it is a very standard one, so it becomes possible to make use of its very well-known solution.  As a programmer, you need not re-implement the solution or even understand it particularly well; you can just re-use an existing implementation.  I used the one provided by the excellent igraph library.  You simply feed it the graph, the start point, and the end point, and it returns the path.  So the only real work is to build the word graph from all the words in the language.

Building the Word Graph

To build a graph out of all words in an English dictionary, you must provide a list of all pairs of words (cat <-> hat ; hat <-> hot; hot <-> shot) that should be connected.  The graph library doesn’t care what order you provide them in, as long as you provide each pair once.  Clearly the list will be large for the English language – there are over 200,000 words in the spelling dictionary on my computer, and there will be many more connections between these.  So how can you generate a list of all possible pairs?  How can you avoid finding the same pair repeatedly?  How do you know when you’re done?  How can you do this efficiently?

A naive algorithm would compare every word in the dictionary, in order, to every other word, in order.  But this is slow – the number of comparisons grows as the square of the size of the dictionary.  We can do much better by sorting the words into equivalence classes.  Consider words that differ only in their first letter.  We put these into an equivalence class by replacing the first character with a special place-holder.  For example, “walk” and “talk” both become “*alk”; they are in the same equivalence class and therefore should be connected.  Apply a similar procedure to the second character, the third character, etc… to build up a table like this:

walk *alk
walk w*lk
walk wa*k
walk wal*
tall *all
tall t*ll
tall ta*l
tall tal*
ear *ar
ear e*r
ear ea*
talk *alk
talk t*lk
talk ta*k
talk tal*

OK, so how does this speed things up?  It becomes clear when we sort the table by equivalence class:

walk *alk
talk *alk
tall *all
ear *ar
ear e*r
ear ea*
talk t*lk
tall t*ll
talk ta*k
tall ta*l
talk tal*
tall tal*
walk w*lk
walk wa*k
walk wal*

The sort operation has grouped together all words that should be connected; “walk -> talk” and “talk -> tall” in this example.  The slow step in this procedure is the sorting, which can be achieved by any of a number of standard algorithms in time proportional to n * log n – a big improvement over n * n for large dictionaries.

So what about the rule that connects words by  adding or deleting one character? This variant of the problem is harder to find online, but I did find an implementation in a comment on this blog similar to the one I used.  The key is to create a table, as before, but for each word add all possible rows formed by deleting one letter:

bring ring
bring bing
bring brng
bring brig
bring brin
rings ings
rings rngs
rings rigs
rings rins
rings ring
ear ar
ear er
ear ea

Then sort by the second column, and filter against the sorted dictionary to retain only those rows where the second column is an actual word:

ear ar
bring bing
bring brig
bring brin
bring brng
ear ea
ear er
rings ings
rings rigs
rings ring
bring ring
rings rins
rings rngs

So that in this example, the three rules “rings <-> ring”, “rings <-> rigs”, and “bring <-> ring” were found.  As before, the sorting is the slow step with O(n log n) time complexity.  So how do you combine the connections due to letter substitution with the connections due to deletion?  This is called a graph union and is supported by any graph library.  An interesting effect happens if you include only the connections due to deletion and leave out connections due to substitution.  Now the word must change length by one letter at each step!

  • hands
  • hand
  • and
  • land
  • lad
  • lead
  • led
  • fled
  • fed
  • feed
  • fee
  • feet

Conclusion: the nerdy stuff matters

Every practicing programmer knows that it is possible – perhaps normal is the right word – to write practical, useful code without drawing on the nerdier parts of computer science. You can write a program that works just fine without ever analyzing it for time complexity or proving its correctness.  The simplest, most intuitive data structures are often best.  But hopefully what the word ladder  example illustrates is that often the best way to solve an everyday, practical problem is to translate it into a special case of a well-studied abstract problem, and then apply a standard solution to that problem.  If solving a word ladder does not seem very practical, applied graph theory shows up everywhere: in revision management, in the provenance of lab samples, in deciding the best layout for a communications network.  So keep studying the nerdy stuff – one day, you will find yourself facing a down-to-earth problem that is harder than it looks, and realize that you already have the answer.

Articles

A program to intelligently remove carriage returns (so you can paste text without having it look awful)

In Line wrap algorithm on July 3, 2011 by gcbenison Tagged: , ,

If you just have text you want to clean up… go here.

If you are interested in how the program works… keep reading.

Here is a problem you have probably experienced at some time or another: you want to paste a plain text document somewhere such as an online form, only the window is narrower than the lines in your document and the result looks a bit like an E.E. Cummings poem: the carriage returns are now sprinkled more or less randomly throughout, creating a very ragged-looking right edge. I have often experienced this with those “paste your resume here” fields in online job applications.

To illustrate, I created the following short bio of Drew Endy, a biologist I recently saw speak at my workplace, by stitching together information from his Wikipedia entry and from Pubmed. The original is sixty characters wide:

plain text posted into a form too wide.

Plain text, posted into a form that is too wide.

That looks OK, except it would be better if the text expanded to fill the available horizontal space. Now let’s paste it into a form that is too narrow. The result is a mess:

Plain text pasted into a form that is too narrow

Plain text pasted into a form that is too narrow.

I will confess that more than once I have responded to this problem by manually editing out the carriage returns – hitting backspace, space, down arrow, etc. for every line until it looks right. There has to be a better way. Couldn’t we get the computer to take out the offending carriage returns for us? (At this point the impatient reader may jump to the complete solution or the result; otherwise keep reading for an explanation.)

To get started, let’s first analyze the problem. Clearly the program should not eliminate every carriage return – paragraphs, section headings, and enumerated lists should be left intact. Since the input is a plain text document lacking explicit markup, the program sometimes must guess at the user’s intention. There will be no one correct solution, but we can come close with a few simple rules:

For every line in the document, merge it with the next line unless:

  1. This line is blank.
  2. The following line is blank.
  3. The next line begins with a non-alphanumeric character.
  4. Appending the first word of the next line to this line would result in a line still shorter than this line.

The perhaps obscure-looking fourth rule is intended to catch ‘subsection-heading’ type lines, such as “Activities:” in the example. Next, let’s translate our pseudocode solution into actual code. Here it is in Haskell:

shouldMerge::[Char]->[Char]->Bool
shouldMerge "" _ = False
shouldMerge _ "" = False
shouldMerge _ nextline | (not . isAlphaNum . head) nextline = False
shouldMerge line nextline | length (line ++ " " ++ (head . words) nextline) <
                              length nextline = False
shouldMerge _ _  = True

Each line of code is pretty much just a statement of one of the rules above – an illustration of how in Haskell, if you can clearly state the problem, you’re just about done writing your program!

To have a complete, working program though, we don’t just want a function that applies to any two lines; we want something that takes any text file as input, applies the rule as many times as needed, and produces a text file as output. Let’s build such a program around a function that takes as input a list of strings (the lines of the input file) and returns a (probably shorter) list of strings in which some of the input lines have been merged:

	main = interact (unlines . unWrapLines . lines)

	unWrapLines::[[Char]]->[[Char]]

Now how do we implement “unWarpLines” in terms of “shouldMerge”? Clearly we need to iterate over every line in the input, and it would be nice to do it with one of the standard iteration operators. To express it as a right-fold, we need to choose how we will use our accumulator variable, the result that is returned by each iteration and then passed to the next one. A natural choice might be to pass the resulting list of lines as it is built. However this won’t work because at each iteration we need to compare each line to its successor in the original input, not in the output. So instead of merging as we go during the fold, let’s just group lines that should be merged into lists, and then in a second sweep merge the grouped lines together:

unWrapLines = (map (stringJoin " ")) . innerUnWrap

innerUnWrap::[[Char]]->[[[Char]]]
innerUnWrap = foldr process []
  where process line [] = [[line]]
        process line ((x:xs):ys) | shouldMerge line x = (line:(x:xs)):ys
        process line rest = [line]:rest

The “stringJoin” function has type [Char]->[[Char]]->[Char]and simply joins lists of strings using a given delimiter. It is available with the Data.String.Utils module, or you can write one yourself in just a few lines. And that’s it — we’re done!See the complete Haskell program here. Let’s see how the example now looks, after running it through our de-line-wrapping filter and pasting it into the same two text boxes:



Much better!

I think this example demonstrates how Haskell is not just for abstract computer science, but is perfectly useful for messy real-world problems. But much as I like the Haskell solution, what I really wanted was an implementation of the de-carriage-return algorithm in Perl so I could use it in a CGI script (which I am not ready to migrate to Haskell, at least not yet.) So coming soon in a follow-up post: automatic carriage return cleansing, in Perl!

note: edited 10/9/11 to move complete code listing from WordPress to github.

note: edited 11/7/11 to spell “pseudocode” correctly (thanks to Ikem)

Follow

Get every new post delivered to your Inbox.