Skip to content

Migrate queries to `crawl` dataset

New tables have been introduced in the HTTP Archive dataset, which are more efficient and easier to use. The crawl dataset contains all the data from the previous pages, requests, and other datasets. This guide will help you migrate your queries to the new dataset.

Migrating to crawl.pages

Page data schemas comparison

previouslycrawl.pages
date in a table namedate
client as _TABLE_SUFFIXclient
url in pages.YYYY_MM_DD_clientpage
not availableis_root_page
not availableroot_page
not availablerank
$.testID within payload column in pages.YYYY_MM_DD_client, wptid column in summary_pages.YYYY_MM_DD_clientwptid
payload in pages.YYYY_MM_DD_clientpayload
req*, resp* and other in summary_pages.YYYY_MM_DD_clientsummary
$.CUSTOM_METRIC_NAME within payload column in pages.YYYY_MM_DD_clientcustom_metrics
report in lighthouse.YYYY_MM_DD_clientlighthouse
feature, type, id in blink_features.featuresfeature, type, id in features
category, app, info in technologies.YYYY_MM_DD_clientcategories, technology, info in technologies
not availablemetadata

Page query updates

  • Migrate custom metrics
/* This query will process 115 GB when run. */
SELECT
client,
STRING(custom_metrics.performance.lcp_elem_stats.nodeName) AS lcp_elem_node_name,
AVG(INT64(custom_metrics.performance.lcp_elem_stats.size)) AS lcp_elem_node_size
FROM `httparchive.crawl.pages`
WHERE
date = '2024-10-01' AND
is_root_page
GROUP BY
client,
lcp_elem_node_name
ORDER BY
client,
lcp_elem_node_size DESC
  • Migrate summary metrics queries
/* This query will process 34 GB when run. */
SELECT
client,
INT64(summary.numDomains) AS numDomains,
COUNT(0) pages,
AVG(INT64(summary.reqTotal)) AS avg_requests
FROM `httparchive.crawl.pages`
WHERE
date = '2024-10-01' AND
is_root_page
GROUP BY
client,
numDomains
HAVING pages > 1000
ORDER BY numDomains ASC
  • Migrate detected technologies metrics
/* This query will process 7.18 GB when run. */
SELECT
page,
technologies.categories,
technologies.technology,
technologies.info
FROM `httparchive.crawl.pages`,
UNNEST (technologies) AS technologies
WHERE
date = '2024-10-01' AND
client = 'desktop' AND
is_root_page
  • Migrate lighthouse insights
/* This query will process 4.2 TB when run. */
SELECT
page,
lighthouse.audits.`largest-contentful-paint`.numericValue AS LCP
FROM `httparchive.crawl.pages`
WHERE
date = '2024-10-01' AND
client = 'desktop' AND
is_root_page
  • Migrate Blink features metrics
/* This query will process 114 GB when run. */
SELECT
page,
features.feature,
features.type,
features.id
FROM `httparchive.crawl.pages`,
UNNEST (features) AS features
WHERE
date = '2024-10-01' AND
client = 'desktop' AND
is_root_page

Migrating to crawl.requests

Request data schemas comparison

previouslycrawl.requests
date in a table namedate
client as _TABLE_SUFFIXclient
page in requests.YYYY_MM_DD_clientpage
not availableis_root_page
not availableroot_page
url in requests.YYYY_MM_DD_clienturl
firstHtml in summary_requests.YYYY_MM_DD_clientis_main_document
type in summary_requests.YYYY_MM_DD_clienttype
$._index within payload in requests.YYYY_MM_DD_clientindex
payload column in requests.YYYY_MM_DD_clientpayload
req*, resp* and other in summary_requests.YYYY_MM_DD_clientsummary
req_* and reqOtherHeaders in summary_requests.YYYY_MM_DD_clientrequest_headers
resp_* and respOtherHeaders in summary_requests.YYYY_MM_DD_clientresponse_headers
body in response_bodies.YYYY_MM_DD_clientresponse_body

Request query updates

  • Migrate headers metrics
/* This query will process 169 GB when run. */
SELECT
response_header.value AS header_value,
FROM `httparchive.crawl.requests`,
UNNEST(response_headers) AS response_header
WHERE
date = '2024-10-01' AND
client = 'desktop' AND
is_main_document AND
is_root_page AND
LOWER(response_header.name) = 'content-type'
  • Migrate summary metrics
/* This query will process 376 GB when run. */
SELECT
page,
url,
STRING(summary.mimeType) AS mimeType,
INT64(summary.respBodySize) AS respBodySize,
FROM `httparchive.crawl.requests`
WHERE
date = '2024-10-01' AND
client = 'desktop' AND
is_root_page
  • Migrate response body queries
/* This query will process 42.8 TB when run. */
SELECT
page,
url,
BYTE_LENGTH(response_body) AS bodySize
FROM `httparchive.crawl.requests`
WHERE
date = '2024-10-01' AND
client = 'desktop' AND
is_root_page