Skip to main content

OTel SDKs In Production

Overview

The per-language pages (Python, Node.js, Java, .NET, Go, Ruby, PHP, Rust) get you a working pipeline in five minutes with sensible defaults. This page covers what changes once you're past the "hello, world" stage: how to choose between deployment topologies, how to tune the SDK's batching processor, how to detect and mitigate backpressure, and how to avoid the most common production gotcha — losing in-flight logs at process shutdown.

The recommendations here apply to every OTel SDK language. We point at language-specific docs when there's a divergence.

Topologies — direct, collector, file + agent

There are three production topologies for getting logs from your application into SparkLogs. Most teams start with direct and graduate to collector when one of the triggers below applies. File + agent is a third path used in specific cases.

1. OTel SDK direct → SparkLogs

Your application's OTel SDK posts OTLP/HTTP batches straight to https://ingest-<REGION>.engine.sparklogs.app/v1/logs. No extra process, no extra config surface.

  • Pros: simplest setup. Only your app's process needs OTel libraries. One credential, one network hop. Lowest operational burden.
  • Cons: each app process holds its own batch buffer (not shared across processes on the same node). If SparkLogs is unreachable for longer than your queue can absorb, the SDK drops events. No central place to redact, sample, or fan out to a second backend.
  • Use when: you're getting started; one or a few app processes per node; you don't need queue-on-outage durability beyond what fits in process memory.

2. OTel SDK → local collector → SparkLogs

Your app's OTel SDK exports to a local collector — for example the OpenTelemetry Collector, Grafana Alloy, or Vector — running as a sidecar, DaemonSet, or per-host service. The collector aggregates from one or many app processes, optionally processes the data, and forwards to SparkLogs.

  • Pros: shared queue across all apps on the node (one buffer, one outbound connection); central place for redaction, sampling, attribute enrichment, fan-out to multiple backends, and credential rotation; can spool to disk via file_storage for queue-on-outage durability; consistent config across many languages / services.
  • Cons: one more process to deploy and monitor; one more place to misconfigure; small added latency.
  • Use when: any of: many services on a node, multi-backend fan-out, central config / secret management, queue-on-outage durability beyond what process memory can hold, want sampling or PII redaction in one place, language whose OTel logs SDK is still maturing.

3. Log to file (or stdout) + log-forwarding agent

Your application logs to a local file or stdout. A separate agent — Vector, Fluent Bit, the OpenTelemetry Collector's filelog receiver, Filebeat, etc. — tails the file and ships the events.

  • Pros: app process has zero observability dependencies; works for languages without a stable OTel logs SDK; works for legacy code you can't change; survives app crashes (the file is durable on disk); container runtimes that already capture stdout require no app-side changes.
  • Cons: text → structured-event parsing happens in the agent (more complex than emitting structured OTel records); harder to correlate with metrics / traces emitted by the same process; structured fields and trace context need extra work.
  • Use when: very high throughput per process; languages without a stable OTel logs SDK; air-gapped or container-runtime-only environments. See the operating-systems page.

Quick decision table

NeedPick
Get started in 5 minutesDirect
Multi-service node, single outbound connectionCollector
Queue on backend outage (disk-spooled)Collector
Sampling / redaction across servicesCollector
Multi-backend fan-outCollector
Language without a stable OTel logs SDKFile + agent
Container runtime already captures stdoutFile + agent
Tightest CPU / memory footprint in appFile + agent

Batching with BatchLogRecordProcessor

Every production OTel SDK setup must use BatchLogRecordProcessor (or the language's idiomatic equivalent) — not SimpleLogRecordProcessor, which exports one log record per HTTP request and is only acceptable in tests.

The batch processor has four knobs, all configurable via OTel-spec environment variables:

VariableDefaultWhat it controls
OTEL_BLRP_MAX_QUEUE_SIZE2048 recordsMaximum records buffered. Records exceeding this are dropped.
OTEL_BLRP_MAX_EXPORT_BATCH_SIZE512 recordsMaximum records per HTTP request. Must be ≤ queue size.
OTEL_BLRP_SCHEDULE_DELAY1000 msTime the processor waits between exports when the queue is below the batch threshold.
OTEL_BLRP_EXPORT_TIMEOUT30000 msPer-export timeout. The processor abandons an in-flight export after this.

The defaults are production-appropriate for most workloads. They prevent per-event sends, respect SparkLogs's per-request size limits, and keep memory pressure low. Don't tune until something tells you to.

When to tune

Two signals tell you to bump the batch processor:

  1. Queue-full warnings in your SDK's internal log — records are being dropped. Bump MAX_QUEUE_SIZE.
  2. High p95 export latency — exports are taking longer than EXPORT_TIMEOUT and being abandoned. Bump EXPORT_TIMEOUT and check network health.

Tuning recipes

Bursty workload, occasional spikes — keep batch size at the default, raise the queue to absorb spikes:

export OTEL_BLRP_MAX_QUEUE_SIZE=8192
export OTEL_BLRP_MAX_EXPORT_BATCH_SIZE=512
export OTEL_BLRP_SCHEDULE_DELAY=1000

Steady-state high throughput — larger batches, longer timeout, queue at ~4× batch size:

export OTEL_BLRP_MAX_QUEUE_SIZE=8192
export OTEL_BLRP_MAX_EXPORT_BATCH_SIZE=2048
export OTEL_BLRP_SCHEDULE_DELAY=1000
export OTEL_BLRP_EXPORT_TIMEOUT=60000

Going past MAX_EXPORT_BATCH_SIZE=2048 rarely helps for logs — at that point you're better off adding a collector, which can apply its own larger batch on top of what your SDK sends.

Don't lower the schedule delay

OTEL_BLRP_SCHEDULE_DELAY defaults to 1 second, which means burst events typically wait at most a second before being exported. Going lower (e.g. 100 ms) doesn't reduce latency to SparkLogs — it just produces smaller batches and more HTTP overhead.

Compression

OTEL_EXPORTER_OTLP_LOGS_COMPRESSION controls the compression the OTel SDK applies to outgoing OTLP/HTTP batches. Common supported values are gzip and none. Python also supports deflate (faster but less compression). The Rust SDK supports zstd (the ideal algorithm for logs) via the zstd-tonic and zstd-http feature flags.

Compression is recommended but not required. SparkLogs does not bill for inbound bytes, so disabling is a valid choice for CPU-constrained workloads — you're trading network bandwidth for CPU on your side. To disable from the SDK:

export OTEL_EXPORTER_OTLP_LOGS_COMPRESSION=none

If you're on a metered network or behind a low-bandwidth VPN, use gzip (or zstd where the Rust SDK supports it).

SparkLogs decompresses transparently on the server side and accepts a wider list of algorithms — including everything required by the OTLP specification — when sent as a valid Content-Encoding header by collectors or custom HTTP shippers. See HTTPS+JSON → Payload Compression for the full list.

Body-size cap and how it relates to batching

Each OTLP/HTTP request to SparkLogs must be ≤ 50 MiB compressed and ≤ 200 MiB after decompression. The batch processor's MAX_EXPORT_BATCH_SIZE controls how many records go into one request.

Rough math at default settings: 512 records × 50 KiB / event ≈ 25 MiB pre-compression; under gzip (typical 5–10× ratio for log data), that's a 3–5 MiB request body. You have ~10× headroom on the compressed cap before you need to think about this.

If you're emitting large bodies (multi-line stack traces, payloads), keep MAX_EXPORT_BATCH_SIZE at the default and rely on the processor's automatic per-request flushing.

See OTLP/HTTP API → Limits for the full body-size rules and trimming behavior.

Backpressure and dropped logs

When the queue fills, the SDK drops the oldest records and emits an internal warning. This is by design — the SDK refuses to grow without bound and refuses to block your application's logging call.

How to detect it

  • SDK internal logging: every OTel SDK emits warnings like BatchLogRecordProcessor: queue full, dropping records at WARN level on its diagnostic logger. Make sure that logger is wired to your application's stdout / stderr / monitoring.
  • OTel internal metrics: if you have OTel metrics enabled, the SDK exposes counters like otelsdk.processor.logs.queue.size and otelsdk.processor.logs.exported.records (exact names vary by language). Alert on queue size approaching MAX_QUEUE_SIZE.
  • Reverse-engineer via SparkLogs — if the count of expected events differs from what shows up in SparkLogs, you're either dropping at the SDK or the export is failing. Check our monitoring ingestion dashboard for the latter.

How to mitigate it

In escalating order of cost:

  1. Bump OTEL_BLRP_MAX_QUEUE_SIZE — spends memory in your app process to absorb larger spikes. Start with 8192; 16384 is reasonable for high-throughput services.
  2. Increase OTEL_EXPORTER_OTLP_LOGS_TIMEOUT — if drops are caused by exports being abandoned, give them more time.
  3. Add a local collector — moves the queue out of your app process. The OpenTelemetry Collector's file_storage extension, for example, can additionally spool to disk for durability across restarts and backend outages.
  4. Switch to file + agent — if your throughput is too high for any in-process queue and you have a file system available, write to a structured log file and let an agent ship it.

Graceful shutdown

The single most common "I logged something but it didn't show up" cause for short-lived processes (CLIs, scripts, AWS Lambda, Cloud Run jobs, batch jobs, tests) is the process exiting before the batch processor flushes its queue.

Always call the SDK's shutdown method before your process exits. It synchronously drains the queue and waits for in-flight exports to complete (up to the configured timeout). Per-language one-liners:

LanguageShutdown call
Pythonlogger_provider.shutdown()
Node.jsawait loggerProvider.shutdown()
JavaloggerProvider.close() (try-with-resources or Runtime.addShutdownHook)
.NETloggerFactory.Dispose() (or rely on host integration)
GologgerProvider.Shutdown(ctx) (defer)
RubyOpenTelemetry.logger_provider.shutdown
PHP$loggerProvider->shutdown()
Rustprovider.shutdown() (or drop)

For long-running services, also wire shutdown to your termination signal handler so SIGTERM-driven shutdowns flush before the process dies. Container orchestrators typically give 10–30 seconds before SIGKILL — well within the 30-second SDK export timeout.

When to add a collector

Add a local collector — typically the OpenTelemetry Collector, but Grafana Alloy and Vector are also OTLP-compatible — when any of these apply:

  • More than ~5 application services on the same node — one collector with a shared queue uses less total memory and produces fewer outbound connections than each service maintaining its own.
  • You need queue-on-outage durability beyond what fits in process memory — for example, the OpenTelemetry Collector's file_storage extension persists to disk.
  • You want to redact PII, drop noisy log levels, or sample logs before they leave your network — easier to do once in a collector than N times in N services.
  • You're shipping to multiple observability backends — fan out from one collector instead of duplicating SDK setup.
  • You're working with a language where the OTel logs SDK is still maturing (currently Ruby, PHP) — emit logs to a collector via OTLP and let the collector handle the wire-protocol details.
  • You want a single place to rotate SparkLogs credentials — the collector holds the credential, app processes know nothing about it.
  • You're running zero-code auto-instrumentation that emits to localhost:4318 by default and don't want to override its endpoint per service.

If none of these apply, stay direct. Adding a collector "just in case" is operational overhead with no benefit.

Note about metrics + traces

Today SparkLogs receives logs only via OTLP — traces, metrics, and profiles are on the roadmap. When those signals come online, the same SDK process will be exporting three streams. At this point, a few things will change:

  • Total queue memory triples if you set per-signal OTEL_*_MAX_QUEUE_SIZE symmetrically. Re-baseline once you turn on metrics and traces.
  • Per-signal endpoints: you'll point each signal's OTEL_EXPORTER_OTLP_<SIGNAL>_ENDPOINT independently while logs is the only supported signal. After traces and metrics ship, you can use the umbrella OTEL_EXPORTER_OTLP_ENDPOINT for all three.
  • Traces drive sampling decisions for logs when log-trace correlation is set up — high trace sampling rates can multiply log volume.

Failure modes and what to expect

Export timeout (OTEL_EXPORTER_OTLP_LOGS_TIMEOUT)

The exporter abandons an HTTP POST after this many milliseconds. The OTel-spec default is 10 seconds; our examples set it to 25000 ms to tolerate cold-start latency on serverless platforms (Cloud Run, Lambda, Cloud Functions). You can lower it to 10 s if you're not on a serverless platform and want faster failure detection; you can raise it to 60 s if you're on a slow link and seeing timeouts.

Retries

Most OTel SDKs retry transient failures (HTTP 429, 503, 504, network errors) with exponential backoff. The retry budget is bounded by OTEL_BLRP_EXPORT_TIMEOUT — once that elapses, the records are dropped. We return retryable status codes at the OTLP layer; your SDK respects them automatically.

Authentication failures (HTTP 401 / 403)

These are not retryable. The SDK logs the failure and drops the batch. Verify your OTEL_EXPORTER_OTLP_LOGS_HEADERS value, and confirm the agent ID / access token in Configure → Agents in the SparkLogs app.

Body-size rejections (HTTP 413)

Means a single batch exceeded our 50 MiB compressed cap. Lower OTEL_BLRP_MAX_EXPORT_BATCH_SIZE. See OTLP/HTTP API → Limits.

Decode failures (HTTP 400)

The OTLP payload was malformed. Check your SDK version (current OTLP-spec compliant SDK versions don't produce this) and the Content-Type header (application/json or application/x-protobuf).

Clock drift

For most production workloads, time-sync the host clock (NTP, chrony, or your platform's equivalent) and you can skip this entire section. On public clouds time should already be synced by default. The clock-drift compensation header below is for environments where NTP isn't viable — IoT, embedded firmware, air-gapped systems, or desktop systems outside of your full control.

If your client clock genuinely can drift by more than 120 seconds from real time and you can't fix it at the OS level, SparkLogs can compensate automatically: send the X-Client-Clock-Utc-Now header (an int64 of UTC seconds since the UNIX epoch, from your client's perspective) on each exported batch of logs, and the server will compare it against its own clock and adjust every timestamp in that batch by the difference.

The catch: the value must be the client's current clock at the moment the HTTP request is submitted, not a value snapshotted at SDK startup. OTEL_EXPORTER_OTLP_LOGS_HEADERS is read once at startup, so it cannot be used for this header. You need a per-export hook.

The supplier / handler needs to run once per HTTP request (per batch), not once per log record, so there is no per-record overhead. Some OTel SDKs offer this hook idiomatically; others don't where you'd have to craft something more custom.

Idiomatic per-export hook (Java, Go, .NET)

OtlpHttpLogRecordExporterBuilder.setHeaders(Supplier<Map<String,String>>) is invoked once per export. It merges with static headers from addHeader(...) and OTEL_EXPORTER_OTLP_LOGS_HEADERS, so leave Authorization in the env var and supply only the dynamic header here:

import java.time.Instant;
import java.util.Map;
import io.opentelemetry.exporter.otlp.http.logs.OtlpHttpLogRecordExporter;

OtlpHttpLogRecordExporter exporter = OtlpHttpLogRecordExporter.builder()
.setHeaders(() -> Map.of(
"X-Client-Clock-Utc-Now", String.valueOf(Instant.now().getEpochSecond())))
.build();

No clean SDK hook today (Python, Node.js, Rust, Ruby, PHP)

These SDKs accept only static headers on the OTLP exporter. Two realistic options:

  • Subclass or wrap the exporter to inject the header at HTTP send time. Doable, but reaches into per-language private API surface that isn't covered by the SDK's stability guarantees, and tends to break across SDK upgrades. Treat this as a workaround, not a long-term integration.
  • Bypass the OTel SDK for logs and ship via our OTLP/HTTP or HTTP+JSON APIs. Use the OTLP/HTTP API or HTTPS+JSON API directly with a small handcrafted client where setting per-request headers is possible. See our Unreal Engine plugin for example code.

The OTel Collector's headers_setter extension is not a workable escape hatch — its supported sources (static value, file, incoming HTTP context, auth attributes) do not include "the current time," and adding that capability requires writing a custom Go extension and building a custom Collector binary via ocb.

Reference