Skip to content

Guided Tour

The HTTP Archive contains a tremendous amount of information that can be used to understand the evolution of the web. And since the raw data is available in Google BigQuery, you can start digging into it with a minimal amount of setup!

If you are new to BigQuery, then the Getting Started guide will walk you through the basic setup. That guide ends with a sample query that explores MIME types from the pages tables. In this guide, we’ll explore more of the tables and build additional queries that you can learn from. The easiest way to get started is by following along, testing some of the queries and learning from them. If you need any help then there is plenty of support available from the community at https://discuss.httparchive.org.

Prerequisites:

  • This guide assumes that you’ve completed the setup from the Getting Started guide.
  • You would be safe processing extremely-large tables contained in this dataset if you follow the minimizing query costs guide.
  • It also assumes some familiarity with SQL. All of the examples provided will be using Standard SQL.

Migration Guides:

This guide is split into multiple sections, each one focusing on different tables in the HTTP Archive. Each section builds on top of the previous one:

  1. Exploring the httparchive.crawl.pages tables
  2. Exploring the httparchive.crawl.requests tables
  3. JOINing pages and requests tables