Skip to main content

AWS Athena Querying

Archived data from SparkLogs is stored in a Hive-partitioned Parquet data lake (details), and is ready to be natively queried by AWS Athena when replicated to an AWS S3 bucket.

Before you begin, make sure your data is being replicated to an S3 bucket and you have at least one days worth of data in the bucket. Also, set the right storage tier for objects in the bucket based on how often you plan to query the data.

Once you setup AWS Athena, you can also use Amazon SageMaker Unified Studio for data processing, ETL, machine learning, model training, inference, and more.

This guide will cover the process for setup, schema discovery, automatic partition discovery, and querying.

Automatically Create Table Schemas with AWS Glue Crawlers

Athena does not infer schemas on its own. Instead, it relies on the AWS Glue Data Catalog. The easiest way to populate the catalog is with an AWS Glue Crawler.

What a Glue Crawler Does

A Glue Crawler:

  • Scans S3 data
  • Detects:
    • Available tables
    • Column names and data types
    • Partition keys (from Hive-style directories)
  • Creates or updates tables in the Glue Data Catalog
  • Makes those tables immediately queryable from Athena

Setup AWS Glue Crawler

Within your SparkLogs archived data within the archive/ prefix in your bucket, each table is then stored under its own subdirectory. One AWS Glue Crawler can be created that will scan all tables:

  1. Open AWS Glue → Crawlers
  2. Create a new crawler
  3. Configure the crawler:
    • Data store: Amazon S3
    • Path: Path to the archive directory (parent of all the table directories), e.g., s3://my-bucket/archive/logs/
    • File format: Parquet
    • Partition detection: Enabled
    • Subsequent crawler runs: Crawl New Sub-folders Only
    • Schedule: We recommend scheduling the crawler the run once per day after replication completes (e.g., 4 AM UTC -- 0 4 * * ? *)
    • Output configuration critical settings:
      • Select or create a target database (e.g., sparklogs-datalake)
      • Set a maximum table threshold to something reasonable (e.g., 32) in case other settings are misconfigured.
      • Under advanced options, uncheck Create a single schema for each S3 path
      • Set the table level for 2 plus the number of additional subdirectories above your archive location. This ensures the crawler correctly identifies which part of the directory tree contains data for different tables.
        • For example, if your path is s3://my-bucket/archive/, set this to 2
        • For example, if your path is s3://my-bucket/parentdir/archive/, set this to 3
        • For example, if your path is s3://my-bucket/parentdir/subdir1/subdir2/archive/, set this to 5
  4. Run the crawler and monitor its progress.

After it completes, you should see:

  • One Glue table for each data table in your archive, each with inferred columns
  • Partition keys like org_id and date
  • These tables available in Athena for querying

As an advanced alternative, you could create one AWS Glue Crawler for each table in the archive to get more granular control over the configuration and scheduling for each table. For very large data sets using one AWS Glue Crawler per table will allow the crawlers to complete in parallel.

Query Data with Athena

Your tables should now be populated and ready to query in Athena.

You can use DDL statements to discover tables and their schemas:

Use SQL to then query the data. For example:

SELECT * FROM logs
WHERE LOWER(source) LIKE LOWER('%api-worker%')
AND LOWER(message) LIKE LOWER('%/request/make-payment%')
AND part_org_id='all' AND archive_date BETWEEN DATE '2026-01-01' AND DATE '2026-02-04';

When querying, always filter on the archive_date partition key to minimize scanned data and reduce costs.

All of the non-standard fields are stored in a JSON field. For example, for the logs table the JSON field is event_json. You can use the json_extract_scalar function to extract specific values from the JSON field. For example:

SELECT * FROM logs
WHERE trace_id = 'bf674a2d-b3a7-4936-8db8-f675dde9ab70'
AND json_extract_scalar(event_json, '$.span_id') = 'ae2a6389-56c8-446d-b596-fb3e3f262149'
AND part_org_id='all' AND archive_date = DATE '2026-02-04';

You may need to use the TRY_CAST function to convert extracted JSON scalar value (which is a string) to the appropriate SQL data type (e.g., BIGINT or DOUBLE).

Note that any timestamps stored in JSON fields are stored as microseconds from the UNIX epoch, and need to be converted back to timestamp values using the from_unixtime_nanos function. For example:

SELECT * FROM logs
WHERE from_unixtime_nanos(TRY_CAST(json_extract_scalar(event_json, '$.first_provisioned_date') AS DECIMAL(20, 10)) * (DECIMAL '1000')) < (TIMESTAMP '2025-06-30 00:00:00')
AND part_org_id='all' AND archive_date = DATE '2026-02-04';

Advanced: Partition Indexes and Partition Projection

If you have thousands of days of archived data and your queries are spending a lot of time in the "Planning" phase, you may be able to speed them up by enabling partition indexes or converting your external table definitions to use partition projection.

Additional Resources