Tuesday, February 25, 2014

Getting Around Strict GROUP BY Restrictions in SQL Server

SQL Server adheres to the ANSI-SQL standard, in that column names in queries with GROUP BY clauses (aggregate) queries must be included in the GROUP BY or in the aggregate clause itself.  

However, sometimes you don't want to group by a certain column.  Why would that be so?  Say, for example, you want to query the msdb.dbo.backupset column for information on your last successful backup for each database, and you also want to return the backup size for that backup, too.  Grouping the database name on MAX(backup_date) is fine, but throw in the backup size and you'll end up with a separate row for each database name, distinct backup size and maximum backup date.  I'll show you what I mean.  

Let's try writing a query to return the database name, backup size and the last backup date, grouped by database name.

SELECT   database_name, backup_size, MAX(backup_finish_date) 
FROM   msdb.dbo.backupset 
GROUP BY  database_name, backup_size



All very well ... except we have multiple rows per database.  So, let's try the next logical step - putting a MAX(backup_size) into the mix:

SELECT   database_name, MAX(backup_size), MAX(backup_finish_date) 
FROM   msdb.dbo.backupset 
GROUP BY  database_name



Did this work?  On the face of it, yes.  But let's not be so hasty - let's query the whole dataset on our interesting columns to see if the information presented is, in fact, true:

SELECT database_name, backup_size, backup_finish_date 
FROM msdb.dbo.backupset 


No!  As you can see, the former query with two MAXes returned a row saying that SANDBOX was last backed up on 2014-02-25 20:22:28.000 with a size of 9325877248 bytes.  This is not true.  The last backup date was 2014-02-25 20:22:28.000 with a backup size of 2727645184 bytes, as seen below.  The double MAXes have mangled the result set, displaying incorrect data!  Why?  Because we simply concatenated both MAX values - in the former query, they are independent, and we need them to be coupled.



Why is this query so difficult to write?  This is because T-SQL has a strict adherence to the GROUP BY standards defined in ANSI-SQL unlike some other database engines, such as MySQL.  The way to get around this is to query back on the data in a different way - a self-join, but on a slightly altered GROUP BY syntax.  This will effectively allow us to JOIN one aggregated query on the data (by backup_finish_date) on another aggregated query (by backup_size), as follows:


SELECT b1.database_name, b2.backup_size, b1.last_backup_date
FROM (
SELECT b.database_name, 
MAX(b.backup_finish_date) [last_backup_date]
FROM msdb.dbo.backupset b
GROUP BY b.database_name ) b1 
INNER JOIN 
(
SELECT b.database_name, b.backup_size, 
b.backup_finish_date 
FROM msdb.dbo.backupset b ) b2
ON b1.database_name = b2.database_name 
AND b1.last_backup_date = b2.backup_finish_date 




Why does this work?  Basically, the first sub-query contains the last backup date per database.  By inner joining on both the database name and the last backup date to the ungrouped result set from msdb.dbo.backupset, we can bring the backup size for that database name and backup date (the maximum per database) into the result set, resulting in a listing of database name, backup size and last backup date (with the backup size correct for the backup date).

This is one way of getting around the GROUP BY restrictions in SQL Server - I'd be keen to hear more.  If you have any alternatives, please feel free to leave a comment below.


No comments:

Post a Comment