Posts Tagged ‘relational database’

Articles

Queries that group tables by contiguous blocks

In SQL on September 26, 2011 by gcbenison Tagged: ,

“We’ve entered an era where data is cheap, but making sense of it is not.” (source)

In a world of terabyte hard drives and fast internet connections, we increasingly have access to larger and larger data sets of every imaginable type.   The shift away from small, manageable samples to more global data sets brings with it the promise of understanding the world more fundamentally, with less bias.  Yet big data brings with it new problems: as storage capacity grows faster than throughput, new techniques are needed just to access the data efficiently; more importantly, information overload sets in: paradoxically, the larger a data set, the more urgent it becomes to have some way of making it smaller and thus understandable.  Put another way, summarizing is becoming more important than ever.

In this post, I will describe a solution to the frequently-encountered problem of summarizing a relational database table as a series of blocks of contiguous rows:

illustration of a table with contiguous blocks of rows

Grouping a table by contiguous rows

This type of grouping occurs often in an economics blog I follow where some variable vs. time is used to define a time period.  The idea is best illustrated by a concrete example.  Say we have a table holding predicted rainfall, in inches, for each day of the next several years:

+------------+--------+
| date       | inches |
+------------+--------+
| 2011-08-31 |  11.46 |
| 2011-09-01 |  11.95 |
| 2011-09-02 |  10.84 |
| 2011-09-03 |  11.78 |
| 2011-09-04 |  12.07 |
| 2011-09-05 |  11.52 |
| 2011-09-06 |  12.18 |
| 2011-09-07 |  12.56 |
| 2011-09-08 |  12.20 |
| 2011-09-09 |  11.06 |
| 2011-09-10 |  11.82 |
| 2011-09-11 |  12.76 |
| 2011-09-12 |  12.00 |
| 2011-09-13 |  13.33 |
| 2011-09-14 |  12.64 |
| 2011-09-15 |  12.82 |
| 2011-09-16 |  11.81 |
| 2011-09-17 |  13.07 |

etc.

And we wish to summarize this data as the start and end dates of wet weather periods where a “wet day” is defined as having more than 10 inches of rainfall.  A simple GROUP BY statement will not do here, because we do not just want a single group of all rows matching our criteria; we want to preserve the order of the rows and find the first and last row of each contiguous block.  So how to do that, given that there is no “look at next row” or “look at last row” function in a SELECT statement?  The key is to join the table to itself, offset by one row:illustration of a table joined with itself, offset by one row

Now, this joined table can be searched for rows where the right column is blue but the left column is not; these rows represent the start of a segment.  Similarly, where segments end, the left column is blue and the right column is not.  Performing such a join requires that each row be numbered sequentially so that we can match on (left.row = right.row – 1).   So we need to modify the query to add a column containing the row number.  There are several ways of doing this, but the method we’ll use here is to build a temporary table with an auto_increment column:
CREATE TEMPORARY TABLE rainfall_ordered (
`Row number` INT PRIMARY KEY AUTO_INCREMENT,
date DATE);
INSERT INTO rainfall_ordered(date)
SELECT date FROM rainfall ORDER BY date;
 The resulting temporary table doesn’t copy the entire contents of each row; rather, it just associates each row number with a primary key from the original table.  Now we can find the start of each wet season by joining this table with itself on row number with an offset of one, filtering for rows containing a transition from “not wet” to “wet”:
SELECT * FROM
(SELECT 0 as `Row number`, NULL as date, 0 as inches, 0 as wet UNION
SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)) rf2
ON (rf1.`Row number`=(rf2.`Row number` - 1))
WHERE (!(rf1.wet) AND (rf2.wet))

+------------+------------+--------+------+------------+------------+--------+------+
| Row number | date       | inches | wet  | Row number | date       | inches | wet  |
+------------+------------+--------+------+------------+------------+--------+------+
|          0 | NULL       |      0 |    0 |          1 | 2011-08-31 |  11.46 |    1 |
|        213 | 2012-03-30 |   9.72 |    0 |        214 | 2012-03-31 |  11.51 |    1 |
|        219 | 2012-04-05 |   9.34 |    0 |        220 | 2012-04-06 |  10.16 |    1 |
|        224 | 2012-04-10 |    9.3 |    0 |        225 | 2012-04-11 |  10.15 |    1 |
|        226 | 2012-04-12 |   9.36 |    0 |        227 | 2012-04-13 |  10.41 |    1 |
|        321 | 2012-07-16 |   9.59 |    0 |        322 | 2012-07-17 |   10.1 |    1 |
|        325 | 2012-07-20 |   9.99 |    0 |        326 | 2012-07-21 |  10.62 |    1 |
|        328 | 2012-07-23 |   9.26 |    0 |        329 | 2012-07-24 |  10.75 |    1 |
|        337 | 2012-08-01 |   9.72 |    0 |        338 | 2012-08-02 |  11.24 |    1 |
|        342 | 2012-08-06 |   9.95 |    0 |        343 | 2012-08-07 |  10.83 |    1 |
etc.

In each row of this result, a non-wet day appears next to the following day, which is wet, and thus corresponds to the beginning of a wet season.  The UNION construct in the query adds a non-wet period before the start of the table, ensuring that if the very first row is a wet day, it will be recognized as the start of a period.In order to obtain a listing of start and end dates, we need to perform a similar query to find the end of wet seasons, and join it to the previous one in the same order.  We do this by again building temp tables and using auto_increment:
/* Save start dates of episodes in a temporary table */
CREATE TEMPORARY TABLE episode_start (`Row number` int primary key auto_increment, date date);
INSERT INTO episode_start(date)
SELECT rf2.date FROM
(SELECT 0 as `Row number`, NULL as date, 0 as inches, 0 as wet UNION
SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)) rf2
ON (rf1.`Row number`=(rf2.`Row number` - 1))
WHERE (!(rf1.wet) AND (rf2.wet));
/* Similarly for end dates… */
CREATE TEMPORARY TABLE episode_end (`Row number` int primary key auto_increment, date date);
SET @next_row_number = (SELECT MAX(`Row number`) + 1 FROM rainfall_ordered2);
INSERT INTO episode_end(date)
SELECT rf1.date FROM
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)
UNION SELECT
@next_row_number AS `Row Number`,
NULL AS date,
0 AS inches,
0 AS wet) rf2
ON (rf1.`Row number`=(rf2.`Row number` – 1))
WHERE ((rf1.wet) AND !(rf2.wet));/* Join on row number to see results */
SELECT episode_start.date AS `Start date`,
episode_end.date AS `End date`
FROM episode_start
JOIN episode_end USING(`Row number`);

+------------+------------+
| Start date | End date   |
+------------+------------+
| 2011-08-31 | 2012-03-29 |
| 2012-03-31 | 2012-04-03 |
| 2012-04-06 | 2012-04-07 |
| 2012-04-11 | 2012-04-11 |
| 2012-04-13 | 2012-04-14 |
| 2012-07-17 | 2012-07-17 |
| 2012-07-21 | 2012-07-21 |
| 2012-07-24 | 2012-07-27 |
| 2012-08-02 | 2012-08-04 |
| 2012-08-07 | 2013-02-27 |
| 2013-03-03 | 2013-03-03 |
| 2013-03-09 | 2013-03-09 |
| 2013-09-03 | 2013-09-03 |
| 2013-09-06 | 2013-09-06 |
| 2013-09-11 | 2013-09-17 |

etc.

And we’ve arrived at what we were looking for – a listing of the start and end dates of rainy periods, a total of 70 rows derived from the original data set of 3000 records.  (A look at the original data verifies that these dates do indeed correspond to the start and end of stretches of rainy days.)  The pattern emerges that there is one long rainy season lasting from Fall to Spring, with a few rainy days and weeks scattered in between.  We could simplify the result further by filtering out very short rainy stretches:
/* Filter out very short rainy stretches */
SELECT episode_start.date AS `Start date`,
episode_end.date AS `End date`,
DATEDIFF(episode_end.date,episode_start.date) AS length
FROM episode_start
JOIN episode_end USING(`Row number`)
HAVING length > 3;

+------------+------------+--------+
| Start date | End date   | length |
+------------+------------+--------+
| 2011-08-31 | 2012-03-29 |    211 |
| 2012-08-07 | 2013-02-27 |    204 |
| 2013-09-11 | 2013-09-17 |      6 |
| 2013-09-20 | 2014-02-21 |    154 |
| 2014-08-08 | 2014-08-14 |      6 |
| 2014-08-16 | 2015-03-26 |    222 |
| 2015-03-29 | 2015-04-04 |      6 |
| 2015-08-18 | 2016-02-13 |    179 |
| 2016-09-02 | 2016-09-06 |      4 |
| 2016-09-17 | 2017-03-05 |    169 |
| 2017-08-01 | 2018-03-19 |    230 |
| 2018-09-03 | 2019-02-11 |    161 |
| 2019-09-06 | 2019-11-16 |     71 |
+------------+------------+--------+
13 rows in set (0.00 sec)

It seems that the rainy season always comes around the same time of year, yet varies considerably in duration and start date.  And there you have it – a useful observation made possible by a 13-line summary of a 3000-record data set.

edited 18jan12 – found another nice example of this technique used here

Follow

Get every new post delivered to your Inbox.