QUALIFY clause is now available in BigQuery


When browsing the BigQuery docs, I discovered that BigQuery now supports the QUALIFY-clause. At time of writing, it’s in preview, but you can already try it out.

The QUALIFY-clause lets you filter on the output of analytic functions. This allows me to scratch an itch I’ve had for a while:

For each distinct value in column X, find the row with max value in column Y

A common pattern is: “For each distinct value in column X, find the row where column Y takes it’s maximum value”. You can accomplish this by using ROW_NUMBER() OVER (PARTITION BY X ORDER BY Y DESC) AS rank.

An example of this would be to find the date at which each of a set of countries reported the largest number of new confirmed covid-19 cases. We’ll be using the bigquery-public-data.covid19_open_data dataset.

Using ROW_NUMBER() OVER (PARTITION BY ...)

SELECT
  country_code,
  date,
  new_confirmed,
  ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY new_confirmed DESC) AS rank
  FROM `bigquery-public-data.covid19_open_data.covid19_open_data` 
WHERE aggregation_level = 0 -- Country level
  AND country_code IN UNNEST(['US', 'GB', 'SE']) 
  -- Limit nr of rows returned for this example:
  AND date BETWEEN '2021-01-01' AND '2021-01-03' 
Table 1: 9 records
country_code date new_confirmed rank
SE 2021-01-03 2754 1
SE 2021-01-01 2618 2
SE 2021-01-02 2531 3
US 2021-01-02 280226 1
US 2021-01-03 207617 2
US 2021-01-01 172805 3
GB 2021-01-02 60410 1
GB 2021-01-03 55163 2
GB 2021-01-01 31835 3

Filtering using nested query

The next step is then to filter by rank = 1. However, since analytic functions are evaluated after both WHERE and HAVING, there’s no way to filter on the output of analytic functions. You’ll end up either with a CTE or a nested query:

SELECT * EXCEPT(rank) FROM (
  SELECT
    country_code,
    date,
    new_confirmed,
    ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY new_confirmed DESC) AS rank
  FROM `bigquery-public-data.covid19_open_data.covid19_open_data` 
    WHERE aggregation_level = 0 
      AND country_code IN UNNEST(['US', 'GB', 'SE'])
) WHERE rank = 1

Filtering using QUALIFY

However QUALIFY can be used to filter the output of analytic functions, which means we can now finally get rid of the nested query:

SELECT
  country_code,
  date,
  new_confirmed,
FROM `bigquery-public-data.covid19_open_data.covid19_open_data` 
WHERE aggregation_level = 0 
  AND country_code IN UNNEST(['US', 'GB', 'SE'])
QUALIFY ROW_NUMBER() OVER (PARTITION BY country_code ORDER BY new_confirmed DESC) = 1
Table 2: 3 records
country_code date new_confirmed
GB 2020-12-29 81493
SE 2020-12-23 11376
US 2021-01-08 312247

Itch scratched :)

Performance considerations

As of now, there is no performance difference between filtering using the nested query. In fact, both examples generate the same execution plan.

Other new goodies

I also noticed that operators for pivoting and unpivoting have been added to BigQuery, something I’ve also wanted for a long time. Lak Lakshmanan shows a few examples.

sql  big-query  gcp  bq