Skip to content

GET_HOST_CATEGORIES function

We are happy to release the largest open-source dataset of website categories, featuring 147 million hosts and 31 million domains, making it the most extensive open-source data available in this area.

The GET_HOST_CATEGORIES function retrieves the category information of a given hostname. It maps hostnames to their respective categories using a pre-classified dataset.

Input

input_host

The function takes a single parameter: input_host, which is a STRING representing the hostname to be classified. The hostname can be a full URL (e.g., https://store.google.com/product) or just the host (e.g., store.google.com).

Type: STRING

Output

  • ARRAY of STRUCT containing:
    • domain (STRING): The queried host.
    • category_id (INT64): The category ID of the host.
    • full_category (STRING): The full hierarchical category.
    • subcategory (STRING): The most specific subcategory.
    • parent_category (STRING): The parent category.

Example Query

SELECT *
FROM UNNEST(httparchive.fn.GET_HOST_CATEGORIES('apple.com'))

You can also integrate this function into another SQL query by joining its result with your target table. Here is an example:

SELECT
url,
category_id,
full_category,
subcategory,
parent_category
FROM `httparchive.urls.latest_crux_mobile`,
UNNEST(`httparchive.fn.GET_HOST_CATEGORIES`(url))

Methodology

The classification of hostnames is performed using Chrome’s Topics API (currently chrome5) by Nurullah Demir and Yohan Beugin using this repository. The taxonomy used is version 2, which consists of 469 categories as defined in the Topics API taxonomy v2.

This model is applied to all requests in the HTTP Archive’s dataset from the first crawl in November 2010 to June 2024. These requests include the page’s HTML document itself as well as all of its subresources.

Raw Data

The raw data for the classifications is stored in the httparchive.urls.categories table. This table consists of pre-classified hostnames with their corresponding categories. The categories follow a hierarchical structure, providing both specific subcategories and broader parent categories.

Please consider the limitations of our method regarding some hosts discussed here. Thus, while this data can be accessed directly, we highly recommend using the GET_HOST_CATEGORIES function due to the handling of hashed subdomains. If your analysis requires working with domains (e.g., google.com instead of maps.google.com), accessing the raw data directly is also appropriate.

Limitations

As with many classification models, our approach has some limitations:

  • Unclassified Hosts: Some hosts might not be classified if they fall outside the scope of the classification model used, such as adult or gambling sites.
  • Hashed Subdomains: For certain hosts like googlesyndication.com and others, the function returns the category of the main domain. This is because these hosts typically contain hashed subdomains, which would otherwise lead to inconsistent classifications.

The function specifically addresses domains (currently top 50) known for having numerous hashed subdomains. Some of these domains include:

  • googlesyndication.com
  • gstatic.com
  • cloudfront.net
  • akamaihd.net
  • doubleclick.net
  • amazonaws.com

And others as seen in the function source.