Skip to main content

BigQuery Querying

Archived data from SparkLogs is stored in a Hive-partitioned Parquet data lake (details), and is ready to be natively queried by BigQuery external tables when replicated to a Google Cloud Storage bucket.

Before you begin:

  • Make sure you have a BigQuery dataset created to hold the external tables.
  • Make sure your data is being replicated to a Google Cloud Storage bucket and you have at least one days worth of data in the bucket.
    • To avoid data transfer fees and best performance, the bucket should be in the same region as your BigQuery dataset
    • If you are using the US multi-region for BigQuery, the bucket should be in the us-central region. (details)
    • If you are using the EU multi-region for BigQuery, the bucket should be in the europe-west4 region.
  • Make sure users that will be querying the data have the necessary BigQuery permissions to create external tables and query data, and also have permissions to list and read objects in the object storage bucket. (For advanced setups, BigQuery BigLake can be used for finer-grained access control if desired.)
  • Set the appropriate storage tier for objects in the bucket based on how often you plan to query the data.

This guide will cover the process for setup, schema discovery, automatic partition discovery, and querying.

Setup BigQuery External Tables

BigQuery external tables can be created that point to the Hive-partitioned Parquet data lake stored in your Google Cloud Storage bucket.

For each table in the archive that you want to query, create an external table in BigQuery. For example, if your bucket is my-bucket, the archive data was replicated to the root of the bucket, and you want to create an external table in the mydataset dataset for the logs table, you would run the following SQL in BigQuery:

CREATE OR REPLACE EXTERNAL TABLE mydataset.logs
WITH PARTITION COLUMNS
OPTIONS (
format = 'PARQUET',
uris = ['gs://my-bucket/archive/logs/*'],
hive_partition_uri_prefix = 'gs://my-bucket/archive/logs',
require_hive_partition_filter = true
);

The schema of the table will be automatically discovered from the metadata in the Parquet files.

Query Data

Your tables should now be populated and ready to query in BigQuery. Use the BigQuery Studio or the bq command-line tool to discover available tables, their schemas, and run SQL queries.

An example query:

SELECT * FROM logs
WHERE LOWER(source) LIKE LOWER('%api-worker%')
AND LOWER(message) LIKE LOWER('%/request/make-payment%')
AND part_org_id='all' AND archive_date BETWEEN '2026-01-01' AND '2026-02-04';

When querying, always filter on the archive_date partition key to minimize scanned data and reduce costs.

All of the non-standard fields are stored in a JSON field. For example, for the logs table the JSON field is event_json. You should use the JSON functions with this data, such as JSON_QUERY and the LAX_* converter functions (converts a JSON value type to a SQL value type or returns NULL).

For example:

SELECT * FROM mydataset.logs
WHERE trace_id = 'bf674a2d-b3a7-4936-8db8-f675dde9ab70'
AND LAX_STRING(JSON_QUERY(event_json, '$.span_id')) = 'ae2a6389-56c8-446d-b596-fb3e3f262149'
AND part_org_id='all' AND archive_date = DATE '2026-02-04';

Note that any timestamps stored in JSON fields are stored as microseconds from the UNIX epoch, and need to be converted back to TIMESTAMP values using the TIMESTAMP_MICROS. For example:

SELECT * FROM mydataset.logs
WHERE TIMESTAMP_MICROS(LAX_INT64(JSON_QUERY(event_json, '$.first_provisioned_date'))) < TIMESTAMP('2025-06-30 00:00:00')
AND part_org_id='all' AND archive_date = DATE '2026-02-04';

Advanced: Loading Data Into Native BigQuery Tables

For improved performance, in addition to querying the data in external tables, you can load the data into native BigQuery tables and incrementally append new data on a daily basis. Then you can query the native tables instead of the external tables.

Native BigQuery tables can take advantage of clustering and partitioning for improved query performance, and you can also then create materialized views and use BI Engine for even faster performance (e.g., for dashboards in Looker Studio).

The simplest way to do this is using a scheduled query that runs daily after replication completes.

To setup, create a native BigQuery table that is an initial copy of the external table. For example for the logs table:

CREATE OR REPLACE TABLE nativedataset.logs
PARTITION BY DATE(t)
CLUSTER BY archive_date, org_id, source
AS
SELECT *
FROM externaldataset.logs;

Then create a scheduled query that runs daily to append any new data from the external table to the native table. Continuing the example:

INSERT INTO `nativedataset.logs`
SELECT *
FROM `externaldataset.logs`
WHERE archive_date > (
-- Calculate the max archive_date from the last 7 days of partitions
SELECT MAX(archive_date)
FROM `nativedataset.logs`
WHERE t >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
);

This will add any new data that has been replicated since the last time the scheduled query ran to the native table. Set the scheduled query to run once per day after replication completes (e.g., 4 AM in the UTC time zone).