Requests table
httparchive.all.requests
is a partitioned and clustered table containing one row per request per page tested in the HTTP Archive. Pages are tested on a monthly basis and as of April 2022, both the root page and one secondary page are tested.
Example queries
Here are some common operations you can perform with the requests
table.
Count the pages crawled
/* This query will process 85 GB when run. */
SELECT
client,
is_root_page,
count(0) AS requests_total
FROM `httparchive.all.requests`
WHERE date = "2024-05-01"
group by client, is_root_page
client | is_root_page | requests_total |
---|---|---|
mobile | true | 1517364094 |
desktop | true | 1299394354 |
mobile | false | 1216156430 |
desktop | false | 1093804725 |
Size of requests served
Let’s check the size of individual requests served from websites across the entire dataset. To do this, we’ll be using the respBodySize summary metric. This metric represents the size of the response payload in bytes. Since 1 byte is very granular, we’ll divide by 1024 to get to 1 KB and then by 100 so that we are looking at this data with bin sizes of 100KB. We’ll also wrap this in a CEIL() function to remove the decimal points and then multiply the result by 100. Using this technique, 1234567 bytes would be rounded to a bin of 1300 KB.
/* This query will process 26 GB when run. */
WITH requests AS (
SELECT
CEIL(CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64)/1024/100)*100 AS responseSize100KB,
COUNT(0) OVER () AS total_requests
FROM
`httparchive.all.requests` TABLESAMPLE SYSTEM (1 PERCENT)
WHERE
date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
AND CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64) > 0
)
SELECT
responseSize100KB,
COUNT(0) AS requests,
COUNT(0)/ANY_VALUE(total_requests) AS pct_requests
FROM requests
GROUP BY
responseSize100KB
ORDER BY
responseSize100KB ASC
LIMIT 10
responseSize100KB | requests | pct_requests |
---|---|---|
100.0 | 10113115 | 0.90864138408777051 |
200.0 | 486257 | 0.043689133714228209 |
300.0 | 188335 | 0.016921490072264605 |
400.0 | 87127 | 0.0078281714260556891 |
500.0 | 54134 | 0.004863822144433972 |
600.0 | 37443 | 0.0033641721017113315 |
700.0 | 26985 | 0.0024245435505883687 |
800.0 | 19817 | 0.0017805143428575023 |
900.0 | 24519 | 0.0022029788147814046 |
1000.0 | 11787 | 0.0010590363102014118 |
We can see that that 91% of requests have a response size less than 100KB. Try repeating this with 10KB bin sizes and you’ll be able to see the spread of response sizes with more granularity.
Popularity of various image formats
Let’s filter out all of the non-Image content and examine the popularity of various image formats. For example, how often is jpg, gif, webp, etc used.
/* This query will process 8 GB when run. */
WITH requests AS (
SELECT
JSON_VALUE(summary, '$.format') AS format,
page,
COUNT(0) OVER() AS total_requests,
COUNT(DISTINCT page) OVER() AS total_pages
FROM
`httparchive.all.requests` TABLESAMPLE SYSTEM (1 PERCENT)
WHERE
date = '2024-06-01'
AND client = 'desktop'
AND is_root_page
AND type = "image"
)
SELECT
format,
COUNT(0) requests,
COUNT(DISTINCT page) pages,
ROUND(COUNT(0) / ANY_VALUE(total_requests), 2) percent_image_requests,
ROUND(COUNT(DISTINCT page) / ANY_VALUE(total_pages), 2) percent_pages
FROM
requests
GROUP BY
format
ORDER BY
requests DESC
format | requests | pages | percent_image_requests | percent_pages |
---|---|---|---|---|
jpg | 1644804 | 1310081 | 0.38 | 0.43 |
png | 1328825 | 1151809 | 0.31 | 0.38 |
gif | 793541 | 495055 | 0.18 | 0.16 |
svg | 250130 | 227550 | 0.06 | 0.08 |
webp | 223783 | 191184 | 0.05 | 0.06 |
ico | 64468 | 64016 | 0.01 | 0.02 |
avif | 29226 | 25794 | 0.01 | 0.01 |
4405 | 3938 | 0.0 | 0.0 | |
heic | 395 | 382 | 0.0 | 0.0 |
Schema
Field name | Type | Description |
---|---|---|
date | DATE | YYYY-MM-DD format of the HTTP Archive monthly crawl |
client | STRING | Test environment: 'desktop' or 'mobile' |
page | STRING | The URL of the page being tested |
is_root_page | BOOLEAN | Whether the page is the root of the origin |
root_page | STRING | The URL of the root page being tested, the origin followed by / |
url | STRING | The URL of the request |
is_main_document | BOOLEAN | Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects |
type | STRING | Simplified description of the type of resource (script, html, css, text, other, etc) |
index | INTEGER | The sequential 0-based index of the request |
payload | STRING | JSON-encoded WebPageTest result data for this request |
summary | STRING | JSON-encoded summarization of request data |
request_headers | ARRAY<Header> | Request headers |
response_headers | ARRAY<Header> | Response headers |
response_body | STRING | Text-based response body |
date
This field is required for all queries over the requests
table.
YYYY-MM-DD format of the HTTP Archive monthly crawl.
Example: date = '2023-06-01'
client
Test environment: 'desktop'
or 'mobile'
.
page
The URL of the page being tested.
Example: page = 'https://har.fyi/'
is_root_page
Whether the page is the root of the origin.
root_page
The URL of the root page being tested, the origin followed by /
.
Example: root_page = 'https://har.fyi/'
url
The URL of the request
is_main_document
Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects
type
Simplified description of the type of resource (script, html, css, text, other, etc)
index
The sequential 1-based index of the request
payload
JSON-encoded WebPageTest result data for this request
See the payload
reference for more details.
summary
JSON-encoded summarization of request data
See the summary
reference for more details.
request_headers
Request headers
See the Header reference for more details.
response_headers
Response headers
See the Header reference for more details.
response_body
Text-based response body