Archive for March, 2012

Articles

The Nerdy Stuff Matters

In algorithms on March 28, 2012 by gcbenison Tagged: , ,

A childhood friend who grew up to be an artist has created many pieces based on letters and words – not just their meanings, but their visual form and their relationships to each other.  Here is one of them:

YES - YET - GET - GOT - GO - NO

The word “yes” changes into the word “no” in five steps, where each step involves either changing, adding, or deleting exactly one letter, with the result of each step being a valid English word.  Upon seeing this I – being a computer programmer – immediately wondered if I could come up with an algorithm to find such sequences in a more general way.  That is, given a starting word and an ending word, could I design an algorithm to transform one word into the other by changing, adding, or deleting exactly one letter at a time, with the result being a valid English word at each step?

After some thought, I did come up with an approach and soon had a working implementation (fair warning: the link is just the source code for a program that does the job; you have to download and compile it to use it.  I’m working on an online version.)  Here is an example of its output:

  • table
  • able
  • ale
  • all
  • hall
  • hail
  • hair
  • chair

Though I had a working program I wanted to see how other programmers would approach the problem, so I posted it as a challenge on Stackoverflow, a site where programmers exchange questions and answers.  As it turns out, this question is a quite common programming challenge (although the rules about adding and deleting a character are most often left out – so that “cat hat hot” would be a valid sequence, but “cat chat that than tan” would not be.  In this form the sequences are called word ladders.)  Variants have appeared as an interview question.  Within minutes of posting, I received answers describing essentially the same approach I had used: “Represent the word relationships as a graph, and use a shortest path algorithm.  Duh.”

Word Ladders as Applied Graph Theory

For those readers not steeped in discrete math and computer science, some definitions are in order.  A graph is a collection of vertices – in this case, the words – and edges – in this case, the transformations relating the words to each other.  Such a graph formed from an English dictionary can be visualized as quite a large web spreading through space; here is a small part of it -

portion of the English language word graph

Once such a graph is defined, the problem of finding the desired sequence of words becomes the problem of finding the shortest possible path between two points in the graph.  This is not an easy problem, but it is a very standard one, so it becomes possible to make use of its very well-known solution.  As a programmer, you need not re-implement the solution or even understand it particularly well; you can just re-use an existing implementation.  I used the one provided by the excellent igraph library.  You simply feed it the graph, the start point, and the end point, and it returns the path.  So the only real work is to build the word graph from all the words in the language.

Building the Word Graph

To build a graph out of all words in an English dictionary, you must provide a list of all pairs of words (cat <-> hat ; hat <-> hot; hot <-> shot) that should be connected.  The graph library doesn’t care what order you provide them in, as long as you provide each pair once.  Clearly the list will be large for the English language – there are over 200,000 words in the spelling dictionary on my computer, and there will be many more connections between these.  So how can you generate a list of all possible pairs?  How can you avoid finding the same pair repeatedly?  How do you know when you’re done?  How can you do this efficiently?

A naive algorithm would compare every word in the dictionary, in order, to every other word, in order.  But this is slow – the number of comparisons grows as the square of the size of the dictionary.  We can do much better by sorting the words into equivalence classes.  Consider words that differ only in their first letter.  We put these into an equivalence class by replacing the first character with a special place-holder.  For example, “walk” and “talk” both become “*alk”; they are in the same equivalence class and therefore should be connected.  Apply a similar procedure to the second character, the third character, etc… to build up a table like this:

walk *alk
walk w*lk
walk wa*k
walk wal*
tall *all
tall t*ll
tall ta*l
tall tal*
ear *ar
ear e*r
ear ea*
talk *alk
talk t*lk
talk ta*k
talk tal*

OK, so how does this speed things up?  It becomes clear when we sort the table by equivalence class:

walk *alk
talk *alk
tall *all
ear *ar
ear e*r
ear ea*
talk t*lk
tall t*ll
talk ta*k
tall ta*l
talk tal*
tall tal*
walk w*lk
walk wa*k
walk wal*

The sort operation has grouped together all words that should be connected; “walk -> talk” and “talk -> tall” in this example.  The slow step in this procedure is the sorting, which can be achieved by any of a number of standard algorithms in time proportional to n * log n – a big improvement over n * n for large dictionaries.

So what about the rule that connects words by  adding or deleting one character? This variant of the problem is harder to find online, but I did find an implementation in a comment on this blog similar to the one I used.  The key is to create a table, as before, but for each word add all possible rows formed by deleting one letter:

bring ring
bring bing
bring brng
bring brig
bring brin
rings ings
rings rngs
rings rigs
rings rins
rings ring
ear ar
ear er
ear ea

Then sort by the second column, and filter against the sorted dictionary to retain only those rows where the second column is an actual word:

ear ar
bring bing
bring brig
bring brin
bring brng
ear ea
ear er
rings ings
rings rigs
rings ring
bring ring
rings rins
rings rngs

So that in this example, the three rules “rings <-> ring”, “rings <-> rigs”, and “bring <-> ring” were found.  As before, the sorting is the slow step with O(n log n) time complexity.  So how do you combine the connections due to letter substitution with the connections due to deletion?  This is called a graph union and is supported by any graph library.  An interesting effect happens if you include only the connections due to deletion and leave out connections due to substitution.  Now the word must change length by one letter at each step!

  • hands
  • hand
  • and
  • land
  • lad
  • lead
  • led
  • fled
  • fed
  • feed
  • fee
  • feet

Conclusion: the nerdy stuff matters

Every practicing programmer knows that it is possible – perhaps normal is the right word – to write practical, useful code without drawing on the nerdier parts of computer science. You can write a program that works just fine without ever analyzing it for time complexity or proving its correctness.  The simplest, most intuitive data structures are often best.  But hopefully what the word ladder  example illustrates is that often the best way to solve an everyday, practical problem is to translate it into a special case of a well-studied abstract problem, and then apply a standard solution to that problem.  If solving a word ladder does not seem very practical, applied graph theory shows up everywhere: in revision management, in the provenance of lab samples, in deciding the best layout for a communications network.  So keep studying the nerdy stuff – one day, you will find yourself facing a down-to-earth problem that is harder than it looks, and realize that you already have the answer.

Articles

Guest commentary: Thien-Thi Nguyen on guile-www

In Scheme,Web development on March 11, 2012 by gcbenison Tagged: , , , ,

When I needed to develop web applications for the first time, naturally I turned to Scheme, a language which has been central to my evolution as a programmer and which remains my go-to language for many things.  I am still an enthusiast for web development in Scheme, and I think it could become much more popular than it is, particularly if it were easier to find hosting and support.

All the libraries you need for Scheme web development already exist. One of the earliest was guile-www, which provides support for HTTP, cookies, CGI, and related things.  For this post I turned to the current maintainer of guile-www to ask what motivated him to become involved in Scheme web development.  It appears that, perhaps because Scheme remains something of a fringe language on the web, the ecosystem around it continues to evolve rapidly.  Projects fork, merge, and fork again; there isn’t any one “Scheme stack” that dominates.  Nevertheless, the basic reasons for using Scheme for web development – it’s simple, it’s concise, it’s elegant – remain the same.

What drew you to get involved in maintaining guile-www?

ttn: Back in 1998 I made various motorcycle trips, and published a “trip log” with photos to a website.  This used some custom Emacs Lisp to push the html and jpeg files to the server (you might find wup.el — web update — somewhere on the net).  I had just discovered Guile and was intrigued by its module system, certainly a step up from Emacs’ crowded flatland.

I stumbled onto Guile-WWW, which IIRC at that time was client-side and HTTP/1.0 only, and started fiddling with it.  Around 2000 or 2001, I somehow also finangled write privs to Guile proper (the umbrella project that also included the CVS sub-repo of Guile-WWW) and started landing patches there.  Unfortunately, due to a misunderstanding with the former Guile maintainer (Marius Vollmer), those privs got cancelled.

Since the code is GPL, i started maintaining the package privately, naming myself as the copyright holder for new code and making releases every so often.  The hope was (and continues to be, though ever more faintly, seeing how things have developed since) that after some time, my changes could be merged back to GNU Guile.

What advantages do you see Scheme having over other languages for web development?

ttn: Less typing and more fun: My Emacs inserts () instead of [] automatically, edits and motions are by sexps, Emacs boldfaces the current list, etc.

Do you have a favorite application that uses guile-www?

ttn: That would be Sizzweb , i suppose.  I use it to serve Debian ISOs and other stuff on the home network.  Client-side, not so much happening…

The development of guile has picked up in pace in the past few  years; do you see its niche growing and that of guile-www with it?

ttn: Yes.  I believe bundling (web …) modules in Guile proper was a strategic mistake that renders Guile-WWW independence even more important.  So “with it” might be better read as “parallel to it”.

Follow

Get every new post delivered to your Inbox.