Minimizing query costs
The HTTP Archive dataset is large and complex, and it’s easy to write queries that are slow and expensive. All BigQuery users have a free quota of 1 TB per month. To stretch your free quota as far as possible, you’ll want to minimize the amount of data that your queries scan. This guide provides some practical tips for minimizing the cost of querying the HTTP Archive dataset.
Use clustered tables
Table | Partitioned by | Clustered by |
---|---|---|
httparchive.crawl.pages | date | client is_root_page rank page |
httparchive.crawl.requests | date | client is_root_page is_main_document type |
For example, the httparchive.crawl.pages
table is partitioned by date
and clustered by the client
, is_root_page
, rank
and page
columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don’t.
Legacy tables like httparchive.pages.2023_05_01_desktop
, however, do not take advantage of these optimizations and always incur the full cost of scanning the entire table.
Use TABLESAMPLE
The TABLESAMPLE
clause allows you to sample a table without scanning the entire table. This is useful for getting a rough idea of the data in a table before running a more expensive query.
For example, without TABLESAMPLE
:
This query will process 6.56 TB when run.
However, the same query with TABLESAMPLE
at 0.01% is much cheaper:
This query will only process 680.01 MB when run.
The 0.01% of rows that are sampled are chosen randomly, so the results of the query will be different each time it’s run.
Use the sample_data
dataset
The sample_data
dataset contains 10k subsets of the full pages and requests tables. These tables are useful for testing queries before running them on the full dataset, without the risk of incurring a large query cost.
Table names correspond to their full-size counterparts of the form [table]_1k
for crawl.pages
and crawl.requests
tables. For example, to query the summary data for the subset of 10k pages, you would use the httparchive.sample_data.pages_10k
table.
Use table previews
BigQuery allows you to preview entire rows of a table without incurring a query cost. This is useful for getting a rough idea of the data in a table before running a more expensive query.
To access the preview, click on a table name from the workspace explorer and select the Preview tab.
Note that generating the preview may be slow for these tables as they include large payloads. Also note that the text values are truncated by default, so you will need to expand the field to get the full value.