Derek Colley

Killer Code - Nested Aggregates using CTEs

(Or, 'what NOT to do on a production SQL Server box')

Yesterday, I was experimenting with pulling out some figures in SQL Server using windowed functions. SQL Server 2012 comes with a couple of really neat system functions called LAG and LEAD, and these are useful for selecting values from n rows behind or in front of a particular row, denoted by column. There's other articles out there that deal with using LAG/LEAD, so I'm not going to cover it here, save to give an example of using this and the requirement I was aiming at.

For reasons of confidentiality I sadly cannot reproduce here the exact code I was using. However here's a simplified example. Here's an initial table configuration:

CREATE TABLE dbo.TestData (
DateCreated DATE,
ArbitraryValue NUMERIC(16,2) )

INSERT INTO dbo.TestData
SELECT '2013-06-01', 342.14 UNION ALL
SELECT '2013-06-01', 659.45 UNION ALL
SELECT '2013-06-01', 283.49 UNION ALL
SELECT '2013-06-01', 903.34 UNION ALL
SELECT '2013-06-01', 129.06 UNION ALL
SELECT '2013-06-01', 756.65 UNION ALL
SELECT '2013-06-01', 239.05 UNION ALL
SELECT '2013-06-01', 194.52 UNION ALL
SELECT '2013-06-01', 804.44 UNION ALL
SELECT '2013-06-01', 116.69

So, what I wanted was to have a third column, produced during a SELECT, which gave me the difference between each ArbitraryValue and the value in the row preceding it. Using LAG, this was quite easy:

SELECT   DateCreated, ArbitraryValue,
ArbitraryValue - LAG(ArbitraryValue, 1, NULL)
OVER ( ORDER BY DateCreated ASC ) [Difference]
FROM dbo.TestData
ORDER BY DateCreated ASC

Next, I decided I wanted a fourth column, to work out the average of the differences - I'm sure there's a mathematical term for this not dissimilar to standard deviation but I'm not a mathematician - and I tried defining a fourth column that averaged the values in all preceding rows using the AVG aggregate, inside a subquery (with DateCreated standing in as a key). The purpose of this column was to, over the DateCreated column, illustrate the regression to the mean of the differences in the ArbitraryValue column, with the latest value being the 'best' average of the differences available:

SELECT t1.DateCreated, t1.ArbitraryValue,
t1.ArbitraryValue - LAG(t1.ArbitraryValue, 1, NULL)
OVER ( ORDER BY t1.DateCreated ASC ) [Difference],
( SELECT AVG(t2.ArbitraryValue)
FROM dbo.TestData t2
WHERE t2.DateCreated < t1.DateCreated
) [Average_Difference]
FROM   dbo.TestData t1
ORDER BY t1.DateCreated

This works wonderfully - for a small result set. When I tested this with absolute values in rows, there was no problem. The execution plan doesn't look bad either - table scans, but then again I haven't defined an index.

So, I modified the query to use aggregates in place of absolute values, as this was the actual requirement. Let's modify the table definition and content as follows which will illustrate what I mean:

TRUNCATE TABLE dbo.TestData
ALTER TABLE dbo.TestData DROP COLUMN ArbitraryValue
INSERT INTO dbo.TestData ( DateCreated )
SELECT '2013-06-01' UNION ALL SELECT '2013-06-01'
UNION ALL SELECT '2013-06-01' UNION ALL
SELECT '2013-06-01' UNION ALL SELECT '2013-06-02'
UNION ALL SELECT '2013-06-02' UNION ALL
SELECT '2013-06-03' UNION ALL SELECT '2013-06-03'
UNION ALL SELECT '2013-06-03' UNION ALL
SELECT '2013-06-04' UNION ALL SELECT '2013-06-05'
UNION ALL SELECT '2013-06-05' UNION ALL
SELECT '2013-06-05' UNION ALL SELECT '2013-06-06'
UNION ALL SELECT '2013-06-06' UNION ALL
SELECT '2013-06-06' UNION ALL SELECT '2013-06-06'
UNION ALL SELECT '2013-06-07' UNION ALL
SELECT '2013-06-07' UNION ALL SELECT '2013-06-08'
UNION ALL SELECT '2013-06-08' UNION ALL
SELECT '2013-06-08' UNION ALL SELECT '2013-06-09'
UNION ALL SELECT '2013-06-09' UNION ALL
SELECT '2013-06-10'

So now I have a table with ten distinct values in it, a total of 25 rows. I now want to modify my SELECT query to give me a count of the number of rows per distinct row (i.e. a row count per distinct day), and calculate the difference, day to day, of these values. In reality, this example had a direct correlation that a row was inserted into the table for an event E. I needed to count the events, grouped by day, the differences between these counts, and the average of these differences. This, unfortunately, didn't work, throwing up an interesting error message:

SELECT t1.DateCreated, COUNT(t1.DateCreated),
COUNT(t1.DateCreated)
- LAG(COUNT(t1.DateCreated), 1, NULL)
OVER ( ORDER BY t1.DateCreated ) [Difference],
( SELECT AVG(COUNT(t2.DateCreated))
FROM dbo.TestData t2
WHERE   t2.DateCreated < t1.DateCreated
) [Average_Difference]
FROM dbo.TestData t1
GROUP BY t1.DateCreated

Msg 130, Level 15, State 1, Line 3
Cannot perform an aggregate function on an expression containing an aggregate or a subquery.

This is due to the SELECT AVG(COUNT(t2.DateCreated)). Fair enough. SQL Server won't allow this as by itself, this isn't a column - it's a statement, and pointless since AVG(COUNT(t2.DateCreated)) == COUNT(t2.DateCreated) for a single value. However, not only have I defined a WHERE clause, I've included it as an extra column using the subquery. While logically this should be fine, it's still a case of 'computer says no'.

I decided to get around this restriction by writing the query as follows, using a CTE for the outer query in place of the subquery, meaning I didn't have to calculate the AVG and COUNT in the same statement:

;WITH t2 (DateCreated, DateCount, Difference) AS (
SELECT t1.DateCreated, COUNT(*),
COUNT(*) - LAG(COUNT(*), 1, NULL)
OVER ( ORDER BY t1.DateCreated ) [Difference]
FROM dbo.TestData t1
GROUP BY t1.DateCreated )
SELECT t3.DateCreated, t3.DateCount, t3.Difference,
( SELECT AVG(t2.Difference)
FROM t2
WHERE t2.DateCreated <= t3.DateCreated
) [Average_Difference]
FROM   t2 t3
ORDER BY t3.DateCreated ASC

For a small result set, this worked beautifully, returning the average difference in the fourth column. The problem came when I tried to scale it up. Result sets for sub-10,000 rows take a few seconds at most. Running this query on a set of 320,000 rows (specified using a WHERE clause which I've omitted from the example above, for clarity) took around 10sec. Scaling it up a bit, I ran this query on 10,000,000 rows (about a year's worth of data) to get the results back in about 1m 30sec. And finally, I attempted to run this for three years of data, estimated about 30m rows.

Oh dear.

Immediate, total devastation. A quick look at Spotlight showed all 16 cores going flat-out at 100% CPU. Alarms started going off. A quick check of IOSphere, the FusionIO software, showed I had managed to achieve what 256 threads of SQLIO at full-tilt could not - complete saturation of FusionIO read-write capability (and bear in mind this can cope with about 24Gbps, around 500k IOPS read, 500k IOPS write - see here for more, http://www.fusionio.com/products/iodrive2-duo/). I stopped the query immediately, and went for the autopsy.

The execution plan showed massive activity with scalar computations with the nested 'call' to t2.Difference causing the first COUNT(*) (in line 3) and the second COUNT(*) (in line 3) to be executed for each and every row returned. This, combined with the simple arithmetic in line 3, combined with the cost of working with the CTE, combined with the subquery AVG calculation, created a massive amount of set-based load that smashed into the CPU schedulers like an iron fist.

I would love to reproduce the execution plans here but shouldn't due to NDA-related reasons. I've got my three years of average differences by using Excel to generate the fourth column instead of SQL. And I'm sure (and if anyone's reading this, they may post) there's fifty different ways of refactoring this query. My next read is going to be Itzik Ben-Gan's 'Microsoft SQL Server 2012 High Performance T-SQL Using Window Functions', so next time I can do it properly.

Lessons learned - don't test on production, no matter how tempting. And don't nest aggregates.

Derek Colley

Thursday, June 13, 2013

No comments:

Post a Comment