“We’ve entered an era where data is cheap, but making sense of it is not.” (source)
In a world of terabyte hard drives and fast internet connections, we increasingly have access to larger and larger data sets of every imaginable type. The shift away from small, manageable samples to more global data sets brings with it the promise of understanding the world more fundamentally, with less bias. Yet big data brings with it new problems: as storage capacity grows faster than throughput, new techniques are needed just to access the data efficiently; more importantly, information overload sets in: paradoxically, the larger a data set, the more urgent it becomes to have some way of making it smaller and thus understandable. Put another way, summarizing is becoming more important than ever.
In this post, I will describe a solution to the frequently-encountered problem of summarizing a relational database table as a series of blocks of contiguous rows:
+------------+--------+ | date | inches | +------------+--------+ | 2011-08-31 | 11.46 | | 2011-09-01 | 11.95 | | 2011-09-02 | 10.84 | | 2011-09-03 | 11.78 | | 2011-09-04 | 12.07 | | 2011-09-05 | 11.52 | | 2011-09-06 | 12.18 | | 2011-09-07 | 12.56 | | 2011-09-08 | 12.20 | | 2011-09-09 | 11.06 | | 2011-09-10 | 11.82 | | 2011-09-11 | 12.76 | | 2011-09-12 | 12.00 | | 2011-09-13 | 13.33 | | 2011-09-14 | 12.64 | | 2011-09-15 | 12.82 | | 2011-09-16 | 11.81 | | 2011-09-17 | 13.07 |etc.
CREATE TEMPORARY TABLE rainfall_ordered (
`Row number` INT PRIMARY KEY AUTO_INCREMENT,
date DATE);
INSERT INTO rainfall_ordered(date)
SELECT date FROM rainfall ORDER BY date;
SELECT * FROM
(SELECT 0 as `Row number`, NULL as date, 0 as inches, 0 as wet UNION
SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)) rf2
ON (rf1.`Row number`=(rf2.`Row number` - 1))
WHERE (!(rf1.wet) AND (rf2.wet))
+------------+------------+--------+------+------------+------------+--------+------+ | Row number | date | inches | wet | Row number | date | inches | wet | +------------+------------+--------+------+------------+------------+--------+------+ | 0 | NULL | 0 | 0 | 1 | 2011-08-31 | 11.46 | 1 | | 213 | 2012-03-30 | 9.72 | 0 | 214 | 2012-03-31 | 11.51 | 1 | | 219 | 2012-04-05 | 9.34 | 0 | 220 | 2012-04-06 | 10.16 | 1 | | 224 | 2012-04-10 | 9.3 | 0 | 225 | 2012-04-11 | 10.15 | 1 | | 226 | 2012-04-12 | 9.36 | 0 | 227 | 2012-04-13 | 10.41 | 1 | | 321 | 2012-07-16 | 9.59 | 0 | 322 | 2012-07-17 | 10.1 | 1 | | 325 | 2012-07-20 | 9.99 | 0 | 326 | 2012-07-21 | 10.62 | 1 | | 328 | 2012-07-23 | 9.26 | 0 | 329 | 2012-07-24 | 10.75 | 1 | | 337 | 2012-08-01 | 9.72 | 0 | 338 | 2012-08-02 | 11.24 | 1 | | 342 | 2012-08-06 | 9.95 | 0 | 343 | 2012-08-07 | 10.83 | 1 | etc.
/* Save start dates of episodes in a temporary table */
CREATE TEMPORARY TABLE episode_start (`Row number` int primary key auto_increment, date date);
INSERT INTO episode_start(date)
SELECT rf2.date FROM
(SELECT 0 as `Row number`, NULL as date, 0 as inches, 0 as wet UNION
SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)) rf2
ON (rf1.`Row number`=(rf2.`Row number` - 1))
WHERE (!(rf1.wet) AND (rf2.wet));/* Similarly for end dates… */CREATE TEMPORARY TABLE episode_end (`Row number` int primary key auto_increment, date date);
SET @next_row_number = (SELECT MAX(`Row number`) + 1 FROM rainfall_ordered2);
INSERT INTO episode_end(date)
SELECT rf1.date FROM
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered
JOIN rainfall USING(date)) rf1
JOIN
(SELECT `Row number`,
date,
inches,
(inches > 10) AS wet
FROM rainfall_ordered2
JOIN rainfall USING(date)
UNION SELECT
@next_row_number AS `Row Number`,
NULL AS date,
0 AS inches,
0 AS wet) rf2
ON (rf1.`Row number`=(rf2.`Row number` – 1))
WHERE ((rf1.wet) AND !(rf2.wet));
/* Join on row number to see results */
SELECT episode_start.date AS `Start date`,
episode_end.date AS `End date`
FROM episode_start
JOIN episode_end USING(`Row number`);
+------------+------------+ | Start date | End date | +------------+------------+ | 2011-08-31 | 2012-03-29 | | 2012-03-31 | 2012-04-03 | | 2012-04-06 | 2012-04-07 | | 2012-04-11 | 2012-04-11 | | 2012-04-13 | 2012-04-14 | | 2012-07-17 | 2012-07-17 | | 2012-07-21 | 2012-07-21 | | 2012-07-24 | 2012-07-27 | | 2012-08-02 | 2012-08-04 | | 2012-08-07 | 2013-02-27 | | 2013-03-03 | 2013-03-03 | | 2013-03-09 | 2013-03-09 | | 2013-09-03 | 2013-09-03 | | 2013-09-06 | 2013-09-06 | | 2013-09-11 | 2013-09-17 |
etc.
/* Filter out very short rainy stretches */
SELECT episode_start.date AS `Start date`,
episode_end.date AS `End date`,
DATEDIFF(episode_end.date,episode_start.date) AS length
FROM episode_start
JOIN episode_end USING(`Row number`)
HAVING length > 3;
+------------+------------+--------+ | Start date | End date | length | +------------+------------+--------+ | 2011-08-31 | 2012-03-29 | 211 | | 2012-08-07 | 2013-02-27 | 204 | | 2013-09-11 | 2013-09-17 | 6 | | 2013-09-20 | 2014-02-21 | 154 | | 2014-08-08 | 2014-08-14 | 6 | | 2014-08-16 | 2015-03-26 | 222 | | 2015-03-29 | 2015-04-04 | 6 | | 2015-08-18 | 2016-02-13 | 179 | | 2016-09-02 | 2016-09-06 | 4 | | 2016-09-17 | 2017-03-05 | 169 | | 2017-08-01 | 2018-03-19 | 230 | | 2018-09-03 | 2019-02-11 | 161 | | 2019-09-06 | 2019-11-16 | 71 | +------------+------------+--------+ 13 rows in set (0.00 sec)
It seems that the rainy season always comes around the same time of year, yet varies considerably in duration and start date. And there you have it – a useful observation made possible by a 13-line summary of a 3000-record data set.
edited 18jan12 – found another nice example of this technique used here



