Monday, 10 January 2011

GROUPING SETS in SQL Server

Something I personally haven't seen a lot of out there in the SQL Server world, is use of GROUPING SETS - an operator that can be applied in a GROUP BY clause. So what does it do? How would you use it?

Take the AdventureWorks sample database as an example playground. Suppose you want to query the sales data to find the following:
1. total sales for each product
2. total sales for each product category
3. total sales

There's a number of ways you could do this. You could of course write 3 separate queries then handle the 3 resultsets returned. Or if you wanted a single resultset, you could use one of these approaches:

1. UNION ALL approach
SELECT d.ProductID, NULL AS ProductCategoryId, SUM(d.LineTotal) AS Total -- Total sales by product
FROM SalesLT.SalesOrderDetail d
GROUP BY d.ProductID
UNION ALL 
SELECT NULL, p.ProductCategoryID, SUM(d.LineTotal) -- Total sales by category
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY p.ProductCategoryID
UNION ALL
SELECT NULL, NULL, SUM(d.LineTotal) -- Total of all sales
FROM SalesLT.SalesOrderDetail d

2. WITH CUBE approach
SELECT ProductID, ProductCategoryID, Total
FROM
(
SELECT d.[ProductID], p.ProductCategoryID, SUM(d.LineTotal) AS Total
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY d.ProductID, p.ProductCategoryID
WITH CUBE
) x
WHERE x.ProductID IS NULL OR x.ProductCategoryID IS NULL

3. GROUPING SETS approach
SELECT d.[ProductID], p.ProductCategoryID, SUM(d.LineTotal) AS Total
FROM SalesLT.SalesOrderDetail d
JOIN SalesLT.Product p ON d.ProductID = p.ProductID
GROUP BY GROUPING SETS((d.ProductID),(p.ProductCategoryID), ())

If you haven't used GROUPING SETS before, this GROUP BY clause may look a bit odd at first. So what's it doing? We're just defining the different things we want to group the results by, in one CSV delimited form in the GROUPING SETS operator. In this scenario, we're saying:
"Return the totals grouped by ProductId, and also the totals grouped by ProductCategoryID and then also the Grand Total (indicated by the final pair of empty brackets)."
You can define a grouping involving combinations of fields too. For example:
GROUP BY GROUPING SETS((d.ProductID),(p.ProductCategoryID), 
(d.ProductID, p.ProductCategoryID),())
which is equivalent to the WITH CUBE subquery part of the SQL given in approach 2 above.

As always, you should check the performance of the various approaches for your scenario as just because you can write a query one way, with less code, it doesn't automatically mean it's going to be the best in terms of performance. Here's the stats for each of the 3 approaches above, based on a cold cache each time.
Approach 1 (UNION ALL): CPU=16, Reads=146, Duration=25
Approach 2 (WITH CUBE): CPU=32, Reads=422, Duration=31
Approach 3 (GROUPING SETS): CPU=31, Reads=422, Duration=28

UPDATE:
Thanks to Simon Sabin (Twitter | Blog) for the suggestion of trying these out on the full AdventureWorks2008 database to show how it performs on a larger dataset. The AdventureWorks2008LT database I used above is a simplified, smaller version of the full sample database e.g. SalesOrderDetail contains 542 rows in the "LT" version of the db compared to 121,317 rows in the full version.

So I ran the same set of tests again on AdventureWorks2008 - I had to make a few cosmetic changes to the queries due to the slight differences in table schemas, but the structure of the query remained exactly the same. Here's the results:

Approach 1 (UNION ALL): CPU=312, Reads=3760, Duration=589
Approach 2 (WITH CUBE): CPU=141, Reads=1833, Duration=339
Approach 3 (GROUPING SETS): CPU=141, Reads=1831, Duration=331

Again, the WITH CUBE & GROUPING SETS approaches are all but equal. However, now the UNION ALL approach is the one that is least performant.

Focusing on the UNION ALL vs the GROUPING SETS approach, when running the two queries in a single batch and displaying the execution plans, the query cost is split as follows:
UNION ALL 70% / GROUPING SETS 30%. The UNION ALL is hit by having to query data from the SalesOrderDetail table multiple times - so in a larger table, the cost of this gets more expensive. Whereas the GROUPING SETS approach uses Table Spool (Eager Spool) operators to fulfil the query . To quote MSDN:
The Table Spool operator scans the input and places a copy of each row in a hidden spool table that is stored in the tempdb database and existing only for the lifetime of the query. If the operator is rewound (for example, by a Nested Loops operator) but no rebinding is needed, the spooled data is used instead of rescanning the input.

This is a great demonstration of why, in the real world, you should always check how a query will perform against production data volumes.

1 comment:

  1. Interesting stuff! This would have definitely come in handy in one of my former lives a report writer.

    ReplyDelete