Sample data from a BigQuery table

In BigQuery when running a query, the costs associated with doing so include a “scan” of the data that may be processed as part of executing the query.

For example, if you have a 100GB table and you try to run the following query

SELECT
    *
FROM myproject.mydatset.some100gbtable

Then BigQuery will charge you for scanning 100GB of data.

In the weird world that is BigQuery billing, you will still be charged for scanning 100GB of data if you use a limit clause in the SQL statement

SELECT
    *
FROM myproject.mydatset.some100gbtable
LIMIT 100

So what do you do if you just want to see a small subset of rows without being charged for scanning the whole table.

We can use TABLESAMPLE clause to only scan a fraction of a large table

SELECT
    *
FROM myproject.mydatset.some100gbtable
TABLESAMPLE SYSTEM ( 10 PERCENT)

Example

Before TABLESAMPLE

288MB is scanned

After TABLESAMPLE

28MB is scanned (or 10% of total)

Send a Comment

Your email address will not be published.