Calculate a reasonable daily usage quota for BigQuery

A previous post showed how to set a ‘daily query usage‘ quota in BigQuery, and what that quota should be to stay within the free tier of BigQuery.

Well, what if your BigQuery query usage exceeds the free tier? How should you decide what is a sensible quota? This post will help you figure out that number.

TL;DR

Copy and paste the following SQL code into the BigQuery editor to have it spit out a recommended quota level.

NB change the following lines with the project and dataset details:

Change line 35 to the appropriate BigQuery region.

/*
A SQL query to calculate an estimate for a resonable
'daily query usage' quota in BigQuery.

Defaults to free tier threshold is predicted value is smaller.

Uses most recent period data to predict future values.

The maximum of the predicted values is then chosen and multiplied 
by a 'buffer_multiplier' to provide some buffer.
*/

-- the number of days to build the history from
DECLARE days_of_history DEFAULT 60;


-- the factor by which the buffer for 
DECLARE buffer_multiplier DEFAULT 2;

-- delete the model once script finishes
DECLARE drop_model BOOLEAN DEFAULT TRUE;


-- create a time series model that is capable of predicting
-- future daily tebibytes used by bigquery
CREATE OR REPLACE MODEL `<PROJECT>.<DATASET>.tebibytesbilledmodel`
 OPTIONS(MODEL_TYPE='ARIMA_PLUS',
         time_series_timestamp_col='day',
         time_series_data_col='total_tebi_billed') AS
SELECT
  DATE(creation_time) AS day,
  SUM(total_bytes_billed) / POW(1024, 4) AS total_tebi_billed
FROM
  -- **CHANGE THE BELOW LINE TO THE REGION YOUR DATA IS IN** 
  `region-us-central1`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
  --uses the days of history as the number of days to create the model
  creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL days_of_history DAY) AND CURRENT_TIMESTAMP()
  AND job_type = "QUERY"
GROUP BY
    day
ORDER BY
    day DESC;



SELECT
  -- get the largest predicted value and scale it up by the 'buffer_multiplier'
  -- returns the free tier quota limit if it is larger.
  ROUND(
    CASE
      WHEN MAX(forecast_value) * buffer_multiplier < 0.03 THEN 0.03
      ELSE MAX(forecast_value) * buffer_multiplier
    END,
    4 
  ) AS recommended_daily_query_usage_quota,
FROM
  ML.FORECAST(MODEL `<PROJECT>.<DATASET>.tebibytesbilledmodel`,
    STRUCT(30 AS horizon, 0.8 AS confidence_level));



-- clean up
IF drop_model THEN
  DROP MODEL`<PROJECT>.<DATASET>.tebibytesbilledmodel`;
END IF;

In detail

The above SQL code creates a time series model based on the last 60 days of data to predict the daily amount of Tebibytes used in queries.

-- create a time series model that is capable of predicting
-- future daily tebibytes used by bigquery
CREATE OR REPLACE MODEL `<PROJECT>.<DATASET>.tebibytesbilledmodel`
 OPTIONS(MODEL_TYPE='ARIMA_PLUS',
         time_series_timestamp_col='day',
         time_series_data_col='total_tebi_billed') AS
SELECT
  DATE(creation_time) AS day,
  SUM(total_bytes_billed) / POW(1024, 4) AS total_tebi_billed
FROM
  `region-us-central1`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
  --uses the days of history as the number of days to create the model
  creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL days_of_history DAY) AND CURRENT_TIMESTAMP()
  AND job_type = "QUERY"
GROUP BY
    day
ORDER BY
    day DESC;

This model is then used to predict daily tebibytes used for 30 days into the future.

The highest predicted daily usage in tebibytes is then selected, and multiplied by the buffer_multiplier to give a value that is twice (default) the largest predicted daily usage figure.

SELECT
  -- get the largest predicted value and scale it up by the 'buffer_multiplier'
  -- returns the free tier quota limit if it is larger.
  ROUND(
    CASE
      WHEN MAX(forecast_value) * buffer_multiplier < 0.03 THEN MAX(forecast_value) * buffer_multiplier
      ELSE MAX(forecast_value) * buffer_multiplier
    END,
    4 
  ) AS recommended_daily_query_usage_quota,
FROM
  ML.FORECAST(MODEL `<PROJECT>.<DATASET>.tebibytesbilledmodel`,
    STRUCT(30 AS horizon, 0.8 AS confidence_level));

The final block deletes the time series model that was created

-- clean up
IF drop_model THEN
  DROP MODEL`<PROJECT>.<DATASET>.tebibytesbilledmodel`;
END IF;

MarquinSmith