Articles

How a changeset is like a vial of bacteria

In biology, Sample-naming algorithm, source code management on December 20, 2011 by gcbenison Tagged: , , , , , ,

What do source code management and keeping track of samples in a biochemistry laboratory have in common? Quite a bit, it turns out.

There are many types of samples in the lab – tubes of purified protein or DNA, liquid bacterial cultures, solid cultures, frozen cell lines, etc. Each sample is derived from one or more other samples (for example, DNA is purified from a cell culture) and can give rise to yet more samples. Yet because there is no way that a sample can somehow be a part of its own ancestry, the sample lineage relationships form an instance of a directed acyclic graph (DAG) – the same entity that can be used to describe the revision history of a body of code.

Managing such a collection requires a good way to assign names to new members as they are generated. Most importantly, the names must be unique, and should also be drawn from some well-defined domain (e.g. “six character alphanumeric string” or “40 hex digit string”). Git famously and brilliantly names each revision with a sha1 digest calculated from the revision itself. Having names calculated from content in this way eliminates the need for synchronization in allocating names. My scheme for naming samples in the lab is more traditional in that it does rely on a sort of locking: each investigator has a pre-printed sheet of unique labels for each day, and takes one from this one sheet every time a new sample is generated. Because there is only one sheet per investigator per day, and each sheet is different, there are no name collisions.

Just as git can reconstruct the revision history of any file by walking through the DAG, I can reconstruct the history of any sample in the lab. Let’s test the efficiency of the process by looking at a typical sample:

A sample tube, displaying unique name

The sample shown above is a mixture of two purified proteins (it is common for labels to include both ad-hoc descriptive language as a sort of mnemonic together with the unique code to facilitate looking up the exact history). We begin by looking up the parents of the unique code G280909E found on the sample (an O(1) operation of finding the sheet this label was taken from, in a sorted binder, and reading off the parent labels). Then we repeat the process for the sample’s parents, etc, and finally end up with this sample’s lineage:

The two samples farthest back in the ancestry are freezer stocks of bacterial strains, one corresponding to each protein found in the final sample; this is the farthest back this sample can be traced. The intermediate nodes correspond to cell culture and purification steps. Using these labels, it is possible to refer back to any notes or data and unambiguously associate it with this physical sample. So how efficient is this process? Reconstructing the above tree took nine minutes, start to finish (and I should note that this interval includes an interruption caused by a minor lab accident!) Certainly not as fast as reconstructing a file history using git, but still not too bad considering that this is one sample out of hundreds in circulation at any time.

Articles

Contigs, in Haskell.

In biology, haskell on October 27, 2011 by gcbenison Tagged: ,

A theme of this blog is that one of the best ways to understand programming algorithms is by analogies with everyday life. Today’s analogy comes from a computational problem that arose from efforts in recent decades to sequence entire genomes. A reminder: a genome consists of one or more strings of DNA bases, where each DNA base or character in the string is chosen from the set (A, T, G, C). A chromosome (or single string of DNA) can be millions of characters long. No currently-existing technology can read a sequence that long; several hundred characters at a time is more typical.

So imagine receiving a message that had been cut up into little fragments:

man who loved the outdoors, and bowling…
an. He was…he was one of us. He was a man w
ne of us. He was a man who loved the outdoors
and a good man. He was…he was one of us. H
e was a man who loved the outdoors, and bowli
was…he was one of us. He was a man who love
loved the outdoors, and bowling…
od man. He was…he was one of us. He was a m

etc.

Reassembling the original would be very difficult if just a single copy of the message had been cut up, leading to disjoint segments. But if multiple copies had been cut up, there would be fragments with overlapping portions – with enough matches, you could reassemble the original. And this is how whole genome sequences are currently solved. This leads to an important data structure for modern biology, the contig: a set of strings each associated with an offset (usually chosen to align all of the strings).

Here is a ‘contig’ data type defined in Haskell:

import Data.List

data Contig = Contig [(String, Int)]

indent::Int->String
indent 0 = ""
indent x = " " ++ (indent $ x - 1)

instance Show Contig where
  show (Contig xs) = unlines $ "" : (map indentContig $
                     sortBy (\a b -> compare (snd a) (snd b)) xs)
    where indentContig (str, x) = (indent x) ++ str

*Main> Contig [("lo world", 3),("hello", 0)]

hello
   lo world

Building a contig requires a good choice of alignment algorithms: the fragments may contain mistakes especially at their ends, and there is usually more than one alignment possible. However with this small example even a rudimentary alignment algorithm (see below) starts to reveal the original message:

*Main> assemble fragments

 and a good man. He was...he was one of us. H
             an. He was...he was one of us. He was a man w
                    was...he was one of us. He was a man who love
                                  ne of us. He was a man who loved the outdoors
                                        us. He was a man who loved the outdoors, and
                                             e was a man who loved the outdoors, and bowli
                                                     man who loved the outdoors, and bowling...

The following is a rudimentary, brute-force algorithm for building contigs in Haskell. For genome-length data it is necessary to use much better methods for string alignment, and I will have more on that later…

interleave::[a]->[a]->[a]
interleave (x:xs)(y:ys) = x:y:(interleave xs ys)
interleave xs [] = xs
interleave [] ys = ys

firstThat::(a -> Maybe b) -> [a] -> Maybe b
firstThat op [] = Nothing
firstThat op (x:xs) = proceed candidate
  where candidate = op x
        proceed (Just y) = Just y
        proceed Nothing = firstThat op xs

-- True if either string is a prefix of the other.
isPrefix::String->String->Bool
isPrefix [] _ = True
isPrefix _ [] = True
isPrefix (x:xs)(y:ys) | x == y = isPrefix xs ys
isPrefix _ _ = False

matchWithOffset::Int->String->String->Bool
matchWithOffset offset str target
      | offset < (- (length str)) = False
      | offset  (length target) = False
      | True = isPrefix str $ drop offset target

matchWithAnyOffset::String->String->Maybe Int
matchWithAnyOffset str target =
  find (\offset -> matchWithOffset offset str target) $
  interleave [0..(length target - 1)] $ map negate [1..(length str - 1)]
  
matchWithAnyOffsetPlus::String->(String, Int)->Maybe Int
matchWithAnyOffsetPlus str (target, target_offset) = op offset
  where op Nothing = Nothing
        op (Just x) = Just (x + target_offset)
        offset = matchWithAnyOffset str target
          
appendContig::Contig->String->Maybe Contig
appendContig (Contig []) str = Just $ Contig [(str, 0)]
appendContig (Contig xs) str = op $ firstThat (matchWithAnyOffsetPlus str) xs
  where op Nothing = Nothing
        op (Just x) | x > 0 = Just $ Contig ((str, x):xs)
        op (Just x) = Just $ Contig ((str, 0):(map (\(s,i)->(s,i-x)) xs))

assemble::[String]->Maybe Contig
assemble = foldr helper1 (Just $ Contig [])
           where helper1 str (Just c) = helper2 (appendContig c str) c
                 helper2 Nothing c = (Just c)
                 helper2 x _ = x

Articles

Line-unwrap: in Perl

In Line wrap algorithm, Perl on October 10, 2011 by gcbenison Tagged:

As promised, in this post I will translate the line-unwrapping algorithm from Haskell to Perl.  (Why?  Not just as an academic exercise – there are many places where a Perl implementation might be wanted, such as in a CGI program.)

The problem was described here – plain text already containing line breaks, when pasted into a fixed-width container that also breaks lines, ends up with ugly jagged right edges.  What is needed is an algorithm to strip out newlines, but not all newlines: just those mid-paragraph, leaving paragraphs, lists, etc. unchanged.

The Haskell solution consisted of a function to decide whether each line should be merged with the next one, and then a “fold” operation to iterate over the input.  The perl version of the whether-to-merge function is a pretty straightforward translation of the Haskell version.  Iteration is a different matter: due to the lack of a “fold” function in the Perl language (but see note below), I relied on custom iteration using the list operator “<>”:

while (<>)
{
  my $this_line = $_;
  chomp $this_line;
  if ($sentinel)
  {
  if (($this_line eq "")
      || ($previous_line eq "")
      || ($this_line =~ /^[^A-Za-z0-9]/)
      || early_indent($previous_line, $this_line))
  { $result .= $previous_line . "\n"; }
  else {$result .= ($previous_line . " "); }
  }
  $previous_line = $this_line;
  $sentinel = 1;
}

Overall, the complete Perl solution is just as succinct as the Haskell one (both are 34 lines long) – however, I find the Haskell version more readable.  The use of custom iteration in the form of Perl’s <> operator with destructive updates is less readily discoverable than a standard “fold” operation.  Also, the strangeness with the “$sentinel” variable was necessary to prevent spurious newlines or spaces from showing up before the first line.  In contrast, the Haskell version resembles a straightforward statement of the problem.  And as is often the case, once the Haskell program compiled correctly, it also ran correctly, whereas with the Perl version I had to go through several edit/test/debug cycles to make it stop doing weird things.  Of course this reflects my relative skill level in the two languages, but it also reflects the advantages of pure functional programming.  I have known Perl a lot longer than I have known Haskell.

note: Since writing this Perl code, I’ve seen that it’s pretty straightforward to write or to find an implementation of ‘fold’ in Perl.  However, it’s definitely not the bread-and-butter of the language, and it’s worth noting that Perl implementations of ‘fold’ rely internally on destructive updates – whereas in Haskell, it’s not too hard to implement ‘fold’ from scratch in a purely functional way.

Articles

Queries that group tables by contiguous blocks

In SQL on September 26, 2011 by gcbenison Tagged: ,

“We’ve entered an era where data is cheap, but making sense of it is not.” (source)

In a world of terabyte hard drives and fast internet connections, we increasingly have access to larger and larger data sets of every imaginable type.   The shift away from small, manageable samples to more global data sets brings with it the promise of understanding the world more fundamentally, with less bias.  Yet big data brings with it new problems: as storage capacity grows faster than throughput, new techniques are needed just to access the data efficiently; more importantly, information overload sets in: paradoxically, the larger a data set, the more urgent it becomes to have some way of making it smaller and thus understandable.  Put another way, summarizing is becoming more important than ever.

In this post, I will describe a solution to the frequently-encountered problem of summarizing a relational database table as a series of blocks of contiguous rows:

illustration of a table with contiguous blocks of rows

Grouping a table by contiguous rows

This type of grouping occurs often in an economics blog I follow where some variable vs. time is used to define a time period.  The idea is best illustrated by a concrete example.  Say we have a table holding predicted rainfall, in inches, for each day of the next several years:

+------------+--------+
| date       | inches |
+------------+--------+
| 2011-08-31 |  11.46 |
| 2011-09-01 |  11.95 |
| 2011-09-02 |  10.84 |
| 2011-09-03 |  11.78 |
| 2011-09-04 |  12.07 |
| 2011-09-05 |  11.52 |
| 2011-09-06 |  12.18 |
| 2011-09-07 |  12.56 |
| 2011-09-08 |  12.20 |
| 2011-09-09 |  11.06 |
| 2011-09-10 |  11.82 |
| 2011-09-11 |  12.76 |
| 2011-09-12 |  12.00 |
| 2011-09-13 |  13.33 |
| 2011-09-14 |  12.64 |
| 2011-09-15 |  12.82 |
| 2011-09-16 |  11.81 |
| 2011-09-17 |  13.07 |

etc.

And we wish to summarize this data as the start and end dates of wet weather periods where a “wet day” is defined as having more than 10 inches of rainfall.  A simple GROUP BY statement will not do here, because we do not just want a single group of all rows matching our criteria; we want to preserve the order of the rows and find the first and last row of each contiguous block.  So how to do that, given that there is no “look at next row” or “look at last row” function in a SELECT statement?  The key is to join the table to itself, offset by one row:illustration of a table joined with itself, offset by one row

Now, this joined table can be searched for rows where the right column is blue but the left column is not; these rows represent the start of a segment.  Similarly, where segments end, the left column is blue and the right column is not.  Performing such a join requires that each row be numbered sequentially so that we can match on (left.row = right.row – 1).   So we need to modify the query to add a column containing the row number.  There are several ways of doing this, but the method we’ll use here is to build a temporary table with an auto_increment column:
CREATE TEMPORARY TABLE rainfall_ordered (
`Row number` INT PRIMARY KEY AUTO_INCREMENT,
date DATE);
INSERT INTO rainfall_ordered(date)
SELECT date FROM rainfall ORDER BY date;
 The resulting temporary table doesn’t copy the entire contents of each row; rather, it just associates each row number with a primary key from the original table.  Now we can find the start of each wet season by joining this table with itself on row number with an offset of one, filtering for rows containing a transition from “not wet” to “wet”:
SELECT * FROM
(SELECT 0 as `Row number`, NULL as date, 0 as inches, 0 as wet UNION
SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)) rf2
ON (rf1.`Row number`=(rf2.`Row number` - 1))
WHERE (!(rf1.wet) AND (rf2.wet))

+------------+------------+--------+------+------------+------------+--------+------+
| Row number | date       | inches | wet  | Row number | date       | inches | wet  |
+------------+------------+--------+------+------------+------------+--------+------+
|          0 | NULL       |      0 |    0 |          1 | 2011-08-31 |  11.46 |    1 |
|        213 | 2012-03-30 |   9.72 |    0 |        214 | 2012-03-31 |  11.51 |    1 |
|        219 | 2012-04-05 |   9.34 |    0 |        220 | 2012-04-06 |  10.16 |    1 |
|        224 | 2012-04-10 |    9.3 |    0 |        225 | 2012-04-11 |  10.15 |    1 |
|        226 | 2012-04-12 |   9.36 |    0 |        227 | 2012-04-13 |  10.41 |    1 |
|        321 | 2012-07-16 |   9.59 |    0 |        322 | 2012-07-17 |   10.1 |    1 |
|        325 | 2012-07-20 |   9.99 |    0 |        326 | 2012-07-21 |  10.62 |    1 |
|        328 | 2012-07-23 |   9.26 |    0 |        329 | 2012-07-24 |  10.75 |    1 |
|        337 | 2012-08-01 |   9.72 |    0 |        338 | 2012-08-02 |  11.24 |    1 |
|        342 | 2012-08-06 |   9.95 |    0 |        343 | 2012-08-07 |  10.83 |    1 |
etc.

In each row of this result, a non-wet day appears next to the following day, which is wet, and thus corresponds to the beginning of a wet season.  The UNION construct in the query adds a non-wet period before the start of the table, ensuring that if the very first row is a wet day, it will be recognized as the start of a period.In order to obtain a listing of start and end dates, we need to perform a similar query to find the end of wet seasons, and join it to the previous one in the same order.  We do this by again building temp tables and using auto_increment:
/* Save start dates of episodes in a temporary table */
CREATE TEMPORARY TABLE episode_start (`Row number` int primary key auto_increment, date date);
INSERT INTO episode_start(date)
SELECT rf2.date FROM
(SELECT 0 as `Row number`, NULL as date, 0 as inches, 0 as wet UNION
SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)) rf2
ON (rf1.`Row number`=(rf2.`Row number` - 1))
WHERE (!(rf1.wet) AND (rf2.wet));
/* Similarly for end dates… */
CREATE TEMPORARY TABLE episode_end (`Row number` int primary key auto_increment, date date);
SET @next_row_number = (SELECT MAX(`Row number`) + 1 FROM rainfall_ordered2);
INSERT INTO episode_end(date)
SELECT rf1.date FROM
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)
UNION SELECT
@next_row_number AS `Row Number`,
NULL AS date,
0 AS inches,
0 AS wet) rf2
ON (rf1.`Row number`=(rf2.`Row number` – 1))
WHERE ((rf1.wet) AND !(rf2.wet));/* Join on row number to see results */
SELECT episode_start.date AS `Start date`,
episode_end.date AS `End date`
FROM episode_start
JOIN episode_end USING(`Row number`);

+------------+------------+
| Start date | End date   |
+------------+------------+
| 2011-08-31 | 2012-03-29 |
| 2012-03-31 | 2012-04-03 |
| 2012-04-06 | 2012-04-07 |
| 2012-04-11 | 2012-04-11 |
| 2012-04-13 | 2012-04-14 |
| 2012-07-17 | 2012-07-17 |
| 2012-07-21 | 2012-07-21 |
| 2012-07-24 | 2012-07-27 |
| 2012-08-02 | 2012-08-04 |
| 2012-08-07 | 2013-02-27 |
| 2013-03-03 | 2013-03-03 |
| 2013-03-09 | 2013-03-09 |
| 2013-09-03 | 2013-09-03 |
| 2013-09-06 | 2013-09-06 |
| 2013-09-11 | 2013-09-17 |

etc.

And we’ve arrived at what we were looking for – a listing of the start and end dates of rainy periods, a total of 70 rows derived from the original data set of 3000 records.  (A look at the original data verifies that these dates do indeed correspond to the start and end of stretches of rainy days.)  The pattern emerges that there is one long rainy season lasting from Fall to Spring, with a few rainy days and weeks scattered in between.  We could simplify the result further by filtering out very short rainy stretches:
/* Filter out very short rainy stretches */
SELECT episode_start.date AS `Start date`,
episode_end.date AS `End date`,
DATEDIFF(episode_end.date,episode_start.date) AS length
FROM episode_start
JOIN episode_end USING(`Row number`)
HAVING length > 3;

+------------+------------+--------+
| Start date | End date   | length |
+------------+------------+--------+
| 2011-08-31 | 2012-03-29 |    211 |
| 2012-08-07 | 2013-02-27 |    204 |
| 2013-09-11 | 2013-09-17 |      6 |
| 2013-09-20 | 2014-02-21 |    154 |
| 2014-08-08 | 2014-08-14 |      6 |
| 2014-08-16 | 2015-03-26 |    222 |
| 2015-03-29 | 2015-04-04 |      6 |
| 2015-08-18 | 2016-02-13 |    179 |
| 2016-09-02 | 2016-09-06 |      4 |
| 2016-09-17 | 2017-03-05 |    169 |
| 2017-08-01 | 2018-03-19 |    230 |
| 2018-09-03 | 2019-02-11 |    161 |
| 2019-09-06 | 2019-11-16 |     71 |
+------------+------------+--------+
13 rows in set (0.00 sec)

It seems that the rainy season always comes around the same time of year, yet varies considerably in duration and start date.  And there you have it – a useful observation made possible by a 13-line summary of a 3000-record data set.

edited 18jan12 – found another nice example of this technique used here

Articles

API discovery with Scheme

In APIhackday, Scheme, Web development on August 2, 2011 by gcbenison

By now JSON is a widely-accepted standard, well-supported in the most common web development languages. But the whole point of an open standard is that one ought to be able work with it in any language and on any platform. Attempting to work with a standard on an unusual platform is a good way to test how “open” it really is, and standards have good reason to pay attention to the results: not only do fringe players sometimes become more important (see, e.g. Linux kernel), they also help keep standards honest by preventing language-dependent or platform-dependent “conveniences” from sneaking in.

So when my 13-year-old brother-in-law Lucien and I arrived at APIhackday, we decided to work with web API’s in two of our favorite languages – lua and Scheme – even though these are not often thought of as web programming languages. We chose to work with PDXCouncilConnect, which provides access to civic records of the City of Portland through a web API. A typical JSON result stream looks like this:

{
    "sessions": [
        {
            "header": "", 
            "id": 304, 
            "location": "City Hall - 1221 SW Fourth Avenue", 
            "message": "Due to lack of an agenda there will be no meeting.", 
            "start_date": "2011-06-01 14:00:00.0", 
            "title": ""
        }, 
        {
            "header": "THOSE PRESENT WERE:  Commissioner Saltzman, Presiding; Commissioners Fish \nand Fritz, 3. ", 
            "id": 302, 
            "items": [
                {
                    "bureaus": "", 
                    "disposition": "", 
                    "disposition_url": "", 
                    "emergency": 0, 
                    "files": "", 
                    "heading": "Times Certain", 
                    "id": 921, 
                    "item_number": "", 
                    "item_number_string": "", 
                    "motions": [
                        {
                            "status": "passed", 
                            "title": "Accepted", 
                            "type": "motion", 
                            "votes": [
                                {
                                    "owner": "City Auditor LaVonne Griffin-Valade", 
                                    "vote": "-"
                                }, 
etc...

A simple JSON parser can be written quickly, but it is also possible to find at least a rudimentary one in almost any language. Lucien found a lua JSON parser, and using the CouncilConnect stream, actually found a bug, wrote a workaround, and by the end of the session was parsing the JSON stream. Though there seems to be no dominant JSON parser for Scheme, I quickly found a simple one. Now that I had the JSON stream as a Scheme object, what to do with it? My first idea was to pull in the guile-gnome library and use it to build a graphical desktop app (using GTK+ widgets) that draws its data from the web stream. By the end of the one-day hackathon it was working, demonstrating the utility of Scheme in pulling together disparate libraries quickly. Though interesting, this result didn’t seem to harness the unique advantages of the Scheme language. After all, JSON comprises nested lists of values, and manipulating lists is what Scheme is for. Couldn’t we use Scheme to transform the data in some useful way? Here is the CouncilConnect stream, represented as a tree:

Tree representation of JSON input

JSON input, represented as a tree

As I watched people around the room working with this API for the first time, I noticed that I and others were spending much of our time studying the JSON stream to understand this underlying tree structure. Though we often knew immediately which element we were interested in (for example, the “votes” node, highlighted), we first had to learn how to navigate the tree in order to get to it. What if there was a way to bring this part of the API, buried deep in the tree, to the forefront without having to understand the whole API? We can re-root a tree at any node by changing the direction of some of the links (visually, grab the new root and hold it up; let the rest of the tree dangle down):

.Re-rooted JSON tree

JSON tree, re-rooted at "votes" element. Edges that have been flipped are highlighted

Now, the most interesting piece of information is easily accessible, right at the top of the tree. It is generally easier – both conceptually and in code – to navigate down a tree than it is to backtrack up it: in the rotated tree, I can still see that the “votes” object is part of a “motions” object which is part of an “items” object which is part of a “session”, and I can still navigate those relationships, but I don’t need to understand them just to get at the “votes” data.

It is possible to write a completely generic Scheme function that takes as arguments any JSON tree and any node-matching predicate, and returns a tree re-rooted at the node matching the predicate (or a list of such trees where more than one node matches). This is the sort of thing Scheme excels at; the heart of the code looks like this:

;; Return an array node,
;; whose members are trees T' which are T re-rooted
;; at nodes N matching (pred N)
(define (reroot T pred)
  (node 'array
	#f
	#f
	(let loop ((T T)
		   (path '()))
	  (if (pred T)
	      (list (rebuild T path 1))
	      (apply append
		     (map-partitions
		      (lambda (child other-children)
			(loop child
			      (cons
			       (node:replace-children T other-children)
			       path)))
		      (node:children T)))))))

;; Rebuild tree rooted at T, but appending links following 'path'
(define (rebuild T path level)
  (if (null? path)
      T
      (node 'object
	    (node:key T)
	    #f
	    (list
	     (node:replace-missing-key T "new_root")
	     (rebuild (node:replace-key
		       (car path)
		       (format #f "xpath-~a-~a"
			       level
			       (or (node:key (car path)) "NA")))
		      (cdr path)
		      (+ level 1))))))

Now, having the re-rooted tree, what should we do with it? How about de-parsing it straight back into JSON? The Scheme program is now acting as a sort of middleware or API virtualization layer – the re-factored JSON stream can be fed to any other JSON-capable tool (in any language). Here is an excerpt of the CouncilConnect JSON stream with “votes” rotated to the top:

[ { "votes": [
            { "owner": "City Auditor LaVonne Griffin-Valade", 
              "vote": "-"  }, 
            {   "owner": "Commissioner Amanda Fritz", 
                "vote": "Yea" }, 
            {   "owner": "Commissioner Nick Fish", 
                "vote": "Yea" }, 
            { "owner": "Mayor Sam Adams", 
                "vote": "-" }
        ], 
        "xpath-1-NA": {
            "xpath-1-NA": {
                "status": "passed", 
                "title": "Accepted", 
                "type": "motion"
            }, 
            "xpath-2-motions": {
                "xpath-2-motions": [], 
                "xpath-3-NA": {
                    "xpath-3-NA": {
                        "bureaus": "", 
                        "emergency": 0, 
                        "heading": "Times Certain", 
                        "id": 921, 
                        "owners": "", 
                        "title": "Portland City Council Agenda", 
                        "topic": "TIME CERTAIN: 9:30 AM - Combined Sewer Overflow Program  (Report introduced by Commissioner Saltzman)  30 minutes requested"
                    }, 

Now, I can easily access the contents of the “vote” object at the head of the stream, and if I care to, I can drill down further to see that the vote was passed and that it was related to a Sewer program.

For another example, let’s hoist the node called “bureaus” to the top and see what happens. In the result below, we find at the head of the JSON stream a piece of business related to the transportation bureau; navigating down a level, we see the result of a vote; down one more level, and we see that this was about disabled parking permits.

[
    { "bureaus": [  "Portland Bureau of Transportation" ], 
        "xpath-1-NA": {
            "xpath-1-NA": {
                "emergency": 0, 
                "files": "", 
                "heading": "Regular Agenda", 
                "id": 922, 
                "item_number": "", 
                "item_number_string": "", 
                "motions": [
                    {
                        "status": "", 
                        "title": "", 
                        "type": "item", 
                        "votes": [
                            {   "owner": "Commissioner Nick Fish", 
                                "vote": "Yea"
                            }, 
                            ...
                        ]
                    }
                ], 
                "owners": [ "Mayor Sam Adams" ], 
                "title": "Portland City Council Agenda", 
                "topic": "Extend the date of the privileges for regular disabled person parking permits (Second Reading Agenda 520; amend Code Section 16.20.640)"
...

So, there it is: an application of Scheme code to web API programming, which leverages the strengths of the language, and which might even be useful. Here is the full code for the JSON-refactoring program:

(Updated 8/7/2011: this code is now also available on github)

;; Greg Benison, 1 August 2011
;; transform JSON input stream (stdin) into a tree;
;; re-root the tree and re-output as JSON (stdout).

(use-modules (json reader)
	     (json writer)
	     (srfi srfi-1)
	     (ice-9 pretty-print))

;; Change this as desired...
(define node-of-interest "votes")

; ---------------- utility functions -----------

(define (->string x)(format #f "~a" x))

(define (compose . ops)
  (if (null? ops)
      (lambda (x) x)
      (lambda (x)
	((car ops)
	 ((apply compose (cdr ops)) x)))))

;; like 'map', but supplies a second
;; argument to 'op' consisting of
;; "the other elements of set" in their
;; original order.
(define (map-partitions op set)
  (let loop ((first-part '())
	     (result '())
	     (rest set))
    (if (null? rest)
	(reverse result)
	(let ((this-element   (car rest))
	      (other-elements (append (reverse first-part)(cdr rest))))
	  (loop (cons this-element first-part)
		(cons (op this-element other-elements) result)
		(cdr rest))))))

(define (json:dump-to-file json fname)
  (let ((outfile (open-output-file fname)))
    (display (json:dump json) outfile)
    (close outfile)))

(define (hash-keys ht)(hash-fold (lambda (k v p)(cons k p)) '() ht))

; ---------------- JSON as tree ----------
(define (node type key value children)
  (cons (cons type key)
	(cons value children)))

(define node:type     caar)
(define node:key      cdar)
(define node:value    cadr)
(define node:children cddr)

(define (node:replace-children N new-children)
  (node (node:type  N)
	(node:key   N)
	(node:value N)
	new-children))

(define (node:replace-key N new-key)
  (node (node:type N)
	new-key
	(node:value N)
	(node:children N)))

(define (node:replace-missing-key N new-key)
  (if (node:key N) N (node:replace-key N new-key)))

(define (node:key? key)
  (lambda (N)(equal? key (node:key N))))

(define (node:take N limit)
  (node:replace-children N (take (node:children N) limit)))

(define (json->tree json key)
  (cond ((hash-table? json)
	 (node 'object
	       key
	       #f
	       (hash-fold
		(lambda (k v p)
		  (cons (json->tree v k) p))
		'()
		json)))
	((list? json)
	 (node 'array
	       key
	       #f
	       (map (lambda (v)(json->tree v #f)) json)))
	(else (node 'scalar key json '()))))

(define (tree->json tree)
  (case (node:type tree)
    ((scalar)(node:value tree))
    ((array)(map tree->json (node:children tree)))
    ((object)
     (let ((table (make-hash-table)))
       (for-each
	(lambda (node)
	  (hash-set! table (node:key node) (tree->json node)))
	(node:children tree))
       table))))

;; Rebuild tree rooted at T, but appending links following 'path'
(define (rebuild T path level)
  (if (null? path)
      T
      (node 'object
	    (node:key T)
	    #f
	    (list
	     (node:replace-missing-key T "new_root")
	     (rebuild (node:replace-key
		       (car path)
		       (format #f "xpath-~a-~a"
			       level
			       (or (node:key (car path)) "NA")))
		      (cdr path)
		      (+ level 1))))))


;; Return an array node,
;; whose members are trees T' which are T re-rooted
;; at nodes N matching (pred N)
(define (reroot T pred)
  (node 'array
	#f
	#f
	(let loop ((T T)
		   (path '()))
	  (if (pred T)
	      (list (rebuild T path 1))
	      (apply append
		     (map-partitions
		      (lambda (child other-children)
			(loop child
			      (cons
			       (node:replace-children T other-children)
			       path)))
		      (node:children T)))))))

; ---------------- input parsing ---------------

(define infile
  (cond ((null? (cdr (command-line)))       (current-input-port))
	((equal? "-" (cadr (command-line))) (current-input-port))
	(else (open-input-file (cadr (command-line))))))

(define input (json:read-value infile))
(define input-as-tree (json->tree input #f))

; ---------------- transform & output ----------

(display
 (json:dump
  (tree->json
   (reroot input-as-tree (node:key? node-of-interest)))))

Articles

A program to intelligently remove carriage returns (so you can paste text without having it look awful)

In Line wrap algorithm on July 3, 2011 by gcbenison Tagged: , ,

If you just have text you want to clean up… go here.

If you are interested in how the program works… keep reading.

Here is a problem you have probably experienced at some time or another: you want to paste a plain text document somewhere such as an online form, only the window is narrower than the lines in your document and the result looks a bit like an E.E. Cummings poem: the carriage returns are now sprinkled more or less randomly throughout, creating a very ragged-looking right edge. I have often experienced this with those “paste your resume here” fields in online job applications.

To illustrate, I created the following short bio of Drew Endy, a biologist I recently saw speak at my workplace, by stitching together information from his Wikipedia entry and from Pubmed. The original is sixty characters wide:

plain text posted into a form too wide.

Plain text, posted into a form that is too wide.

That looks OK, except it would be better if the text expanded to fill the available horizontal space. Now let’s paste it into a form that is too narrow. The result is a mess:

Plain text pasted into a form that is too narrow

Plain text pasted into a form that is too narrow.

I will confess that more than once I have responded to this problem by manually editing out the carriage returns – hitting backspace, space, down arrow, etc. for every line until it looks right. There has to be a better way. Couldn’t we get the computer to take out the offending carriage returns for us? (At this point the impatient reader may jump to the complete solution or the result; otherwise keep reading for an explanation.)

To get started, let’s first analyze the problem. Clearly the program should not eliminate every carriage return – paragraphs, section headings, and enumerated lists should be left intact. Since the input is a plain text document lacking explicit markup, the program sometimes must guess at the user’s intention. There will be no one correct solution, but we can come close with a few simple rules:

For every line in the document, merge it with the next line unless:

  1. This line is blank.
  2. The following line is blank.
  3. The next line begins with a non-alphanumeric character.
  4. Appending the first word of the next line to this line would result in a line still shorter than this line.

The perhaps obscure-looking fourth rule is intended to catch ‘subsection-heading’ type lines, such as “Activities:” in the example. Next, let’s translate our pseudocode solution into actual code. Here it is in Haskell:

shouldMerge::[Char]->[Char]->Bool
shouldMerge "" _ = False
shouldMerge _ "" = False
shouldMerge _ nextline | (not . isAlphaNum . head) nextline = False
shouldMerge line nextline | length (line ++ " " ++ (head . words) nextline) <
                              length nextline = False
shouldMerge _ _  = True

Each line of code is pretty much just a statement of one of the rules above – an illustration of how in Haskell, if you can clearly state the problem, you’re just about done writing your program!

To have a complete, working program though, we don’t just want a function that applies to any two lines; we want something that takes any text file as input, applies the rule as many times as needed, and produces a text file as output. Let’s build such a program around a function that takes as input a list of strings (the lines of the input file) and returns a (probably shorter) list of strings in which some of the input lines have been merged:

	main = interact (unlines . unWrapLines . lines)

	unWrapLines::[[Char]]->[[Char]]

Now how do we implement “unWarpLines” in terms of “shouldMerge”? Clearly we need to iterate over every line in the input, and it would be nice to do it with one of the standard iteration operators. To express it as a right-fold, we need to choose how we will use our accumulator variable, the result that is returned by each iteration and then passed to the next one. A natural choice might be to pass the resulting list of lines as it is built. However this won’t work because at each iteration we need to compare each line to its successor in the original input, not in the output. So instead of merging as we go during the fold, let’s just group lines that should be merged into lists, and then in a second sweep merge the grouped lines together:

unWrapLines = (map (stringJoin " ")) . innerUnWrap

innerUnWrap::[[Char]]->[[[Char]]]
innerUnWrap = foldr process []
  where process line [] = [[line]]
        process line ((x:xs):ys) | shouldMerge line x = (line:(x:xs)):ys
        process line rest = [line]:rest

The “stringJoin” function has type [Char]->[[Char]]->[Char]and simply joins lists of strings using a given delimiter. It is available with the Data.String.Utils module, or you can write one yourself in just a few lines. And that’s it — we’re done!See the complete Haskell program here. Let’s see how the example now looks, after running it through our de-line-wrapping filter and pasting it into the same two text boxes:



Much better!

I think this example demonstrates how Haskell is not just for abstract computer science, but is perfectly useful for messy real-world problems. But much as I like the Haskell solution, what I really wanted was an implementation of the de-carriage-return algorithm in Perl so I could use it in a CGI script (which I am not ready to migrate to Haskell, at least not yet.) So coming soon in a follow-up post: automatic carriage return cleansing, in Perl!

note: edited 10/9/11 to move complete code listing from WordPress to github.

note: edited 11/7/11 to spell “pseudocode” correctly (thanks to Ikem)

Articles

How I name my samples

In Sample-naming algorithm on June 12, 2011 by gcbenison

In programming and in other things, sometimes half your job comes down to picking good names for things -

Working in a biochemistry lab means constantly generating new samples- sometimes dozens per day.  Some of these samples become the source of multiple data points.  Some samples end up being archived for months or even years.  Projects are interleaved in time.  All of this requires that every researcher adopt a system for attaching names to samples.  There is no choice here – you have a system, even if you didn’t consciously adopt it!

I want my sample names to meet these four criteria:

  • Unique – the name must be obviously distinct from thousands of others, and remain unique over time
  • Terse – long, descriptive names won’t fit on 1 cm.  tubes
  • Typical – a sample identifier should stand out in my notes as such, amid all the other writing
  • Easily generated – at the time of sample generation I want to spend zero effort and time on picking the name and ensuring that it meets the other three criteria

I solve the problem by choosing sample identifiers from a namespace composed of: the letter ‘G’ + the date in ‘DDMMYY’ format + one letter [A-Z].  For example, G120611A, G120611B, etc.  This provides 24 unique sample names per day.  I print them out on daily sample manifests with one unique name per line with space to pencil in a more verbose description:

Daily sample manifest

To assign a name to a sample, I just pick the next empty line on the manifest, and I have confidence that the chosen name is unique and will remain so. Keeping these manifests in a folder provides a succinct running record of all samples I have generated. The names are terse (8 characters) and easily fit on all common lab containers (note that I can and usually do include descriptive information on containers in addition to the unique identifier.)  I include an extra copy of the identifier on each line so I can cut it out and paste it to the physical sample.  Of course it can also be written with a Sharpie:

Eppie tube with unique labelUnique names on Falcon tubes

Having the unique identifier on the physical sample makes it easy to look up information about it in my notebook, and to be sure that the notes refer to that exact sample:

Notebook page

Here is a shell script that will generate sample manifests for the next 30 days:

#!/bin/sh
#
# Generate pages with 24 unique sample names per page, one name per line,
# for the next 30 days.
#
# GCB 12jun2011

n_days=30;

# header section
cat <<EOF
%!PS-Adobe-2.0
/title-font /Helvetica findfont 13 scalefont def
/mini-font  /Helvetica findfont 7  scalefont def
/std-font   /Helvetica findfont 10 scalefont def

/inch {72 mul} def

/page-height 9.5 inch def
/page-width 7.0 inch def
/n-rows 26 def
/v-delta page-height 20 sub n-rows div def

/title {
  1.1 inch dup translate

  0 page-height 0.5 inch sub translate
  0 0 moveto
  gsave
  title-font setfont show
  grestore

  2 inch 0 moveto
  std-font setfont
  (Mellies Lab, Reed College; unique sample labels) show
} def

/label {
  /txt exch def
  0 v-delta -1 mul translate

  0 0 moveto
  std-font setfont
  txt show

  gsave
  0 -5 moveto
  0.6 setlinewidth
  page-width 0 rlineto stroke

  0.3 setgray
  (G666666A) stringwidth pop 10 add -5 moveto
  0 20 rlineto stroke
  grestore

  mini-font setfont
  page-width (G666666A) stringwidth pop 2 mul sub 0 moveto
  txt show
  10 0 rmoveto
  txt show

} def

EOF

for delta in `seq -4 $n_days`
do
  date "+(%a, %b %d %Y) title" --d "+$delta days";
  for idx in A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
  do
    date "+(G%d%m%y$idx) label" --d "+$delta days";
  done
  echo;
  echo "showpage";
  echo;
done
Follow

Get every new post delivered to your Inbox.