Archive Data Format
Directory Structure
Objects within the (replicated) archive will have the following structure:
archive/
(tablename)/
part_org_id=all/
archive_date=YYYY-MM-DD/
part-jYYYYMMDDhhmmssSSS-00000.parquet
part-jYYYYMMDDhhmmssSSS-00001.parquet
...
audit_logs/
(log_type)/
(workspace_id)/
job-YYYY-MM-DDThh:mm:ssZ.txt
Data Lake Data Format
The prefix archive/ contains the archived observability Data Lake in Apache Hive-partitioned parquet format.
Hive partitioning means that each table has different parquet files organized into subdirectories based on partition keys.
In this case, the partition keys are part_org_id (organization ID used for partitioning) and archive_date. At this time,
data for all organizations is stored together under the part_org_id=all/ partition. Each archive_date partition contains
the parquet files for data archived on that date (generally, the data that was ingested on that date in UTC).
Hive partitioning is supported by large-scale Big Data query engines to reduce the amount of data that needs to be scanned for queries that filter on the partition keys. This allows you to more efficiently query specific date ranges of data. Additionally, the parquet format is a columnar storage format that allows for efficient compression and encoding of data, further reducing the amount of data that needs to be scanned for queries, and reducing the amount of data to be scanned to only those columns you include in your queries.
Tables in Data Lake
The following tables are included in the archived Data Lake:
logs: One row per event ingested.log_batches: One row per ingested log batch. Has event and byte count, unique field names, and a mapping between ingested times and event times.logs_json_fields: Contents of a materialized view summarizing the unique field names present per organization per hour.logs_sources: Statistics for each event source per organization per hour.
If Game Engine Analytics is enabled, the following additional tables are included:
g_analytics: One row per analytics event. Flattens many of the common analytics fields to distinct columns for convenience.g_analytics_event_ids: Contents of a materialized view summarizing the unique event IDs present per organization per hour.g_analytics_snapshot_users: Populated daily with a summary of active daily users for each given day.g_analytics_snapshot_events: Populated daily with a summary of event statistics per event ID per user for each given day.
Audit Logs Format
The prefix audit_logs/ contains log files for each archival and replication job that runs daily. Each job log is a text file
that contains detailed information about the archival or replication job, including its overall status. Note that since the
replication audit log is generated after replication completes, your replicated storage will contain audit logs for all
archival jobs but replication audit logs will be one day behind.