When browsing the BigQuery docs, I discovered that BigQuery now supports the QUALIFY
-clause. At time of writing, it’s in preview, but you can already try it out.
The QUALIFY
-clause lets you filter on the output of analytic functions. This allows me to scratch an itch I’ve had for a while:
For each distinct value in column X, find the row with max value in column Y
A common pattern is: “For each distinct value in column X, find the row where column Y takes it’s maximum value”. You can accomplish this by using ROW_NUMBER() OVER (PARTITION BY X ORDER BY Y DESC) AS rank
.
An example of this would be to find the date at which each of a set of countries reported the largest number of new confirmed covid-19 cases. We’ll be using the
bigquery-public-data.covid19_open_data
dataset.
Using ROW_NUMBER() OVER (PARTITION BY ...)
SELECT
country_code,
date,
new_confirmed,
ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY new_confirmed DESC) AS rank
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE aggregation_level = 0 -- Country level
AND country_code IN UNNEST(['US', 'GB', 'SE'])
-- Limit nr of rows returned for this example:
AND date BETWEEN '2021-01-01' AND '2021-01-03'
country_code | date | new_confirmed | rank |
---|---|---|---|
SE | 2021-01-03 | 2754 | 1 |
SE | 2021-01-01 | 2618 | 2 |
SE | 2021-01-02 | 2531 | 3 |
US | 2021-01-02 | 280226 | 1 |
US | 2021-01-03 | 207617 | 2 |
US | 2021-01-01 | 172805 | 3 |
GB | 2021-01-02 | 60410 | 1 |
GB | 2021-01-03 | 55163 | 2 |
GB | 2021-01-01 | 31835 | 3 |
Filtering using nested query
The next step is then to filter by rank = 1
. However, since analytic functions are evaluated after both WHERE
and HAVING
, there’s no way to filter on the output of analytic functions. You’ll end up either with a CTE or a nested query:
SELECT * EXCEPT(rank) FROM (
SELECT
country_code,
date,
new_confirmed,
ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY new_confirmed DESC) AS rank
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE aggregation_level = 0
AND country_code IN UNNEST(['US', 'GB', 'SE'])
) WHERE rank = 1
Filtering using QUALIFY
However QUALIFY
can be used to filter the output of analytic functions, which means
we can now finally get rid of the nested query:
SELECT
country_code,
date,
new_confirmed,
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE aggregation_level = 0
AND country_code IN UNNEST(['US', 'GB', 'SE'])
QUALIFY ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY new_confirmed DESC) = 1
country_code | date | new_confirmed |
---|---|---|
GB | 2020-12-29 | 81493 |
SE | 2020-12-23 | 11376 |
US | 2021-01-08 | 312247 |
Itch scratched :)
Performance considerations
As of now, there is no performance difference between filtering using the nested query. In fact, both examples generate the same execution plan.
Other new goodies
I also noticed that operators for pivoting and unpivoting have been added to BigQuery, something I’ve also wanted for a long time. Lak Lakshmanan shows a few examples.