Skip to main content

Observability Data Lake — Integration Reference

This document provides the complete data source inventory, configuration snippets, telemetry mapping, and query examples for the b4arena observability data lake. It serves as the technical reference for implementors deploying the architecture described in the ADR.


Architecture Overview

┌─── MIMAS ────────────────────────────────────────┐
│ │
│ Claude Code agents ──┐ │
│ Codex CLI agents ────┤── OTLP/gRPC ──┐ │
│ │ │ │
│ OTel Collector ◄─────┘ │ │
│ ├─ otlp receiver ◄───────────────────┘ │
│ ├─ hostmetricsreceiver │
│ ├─ podmanreceiver │
│ ├─ memory_limiter processor │
│ ├─ batch processor │
│ └─ otlphttp exporter ───────────────────────► │
└──────────────────────────────────────────────┬───┘

OTLP/HTTP │

┌─── GREPTIMEDB HOST ────────┐
│ :4000 HTTP / PromQL / OTLP │
│ :4001 gRPC │
│ :4002 SQL (MySQL-compat) │
│ │
│ Grafana │
│ └─ GreptimeDB plugin │
│ (SQL + PromQL sources) │
└──────────────────────────────┘

Key topology: OTel Collector runs on Mimas. GreptimeDB runs on a separate dedicated machine. All agent telemetry flows through the Collector, which exports to GreptimeDB via OTLP/HTTP.


Data Source: Claude Code Native OTel

Configuration

Set these environment variables in the agent's runtime environment:

CLAUDE_CODE_ENABLE_TELEMETRY=1
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_METRICS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=claude-code-agent-<name>

Cardinality Controls

# Include session_id as a metric dimension (increases cardinality)
OTEL_METRICS_INCLUDE_SESSION_ID=false

# Reduce cardinality by excluding high-cardinality attributes
OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT=256

Emitted Metrics

Metric NameTypeUnitKey Dimensions
claude_code_token_usageCountertokensmodel, type (input/output/cache_read/cache_creation), service_name
claude_code_cost_usageCounterUSDmodel, service_name
claude_code_lines_of_codeGaugelinesfile_type, change_type (added/removed), service_name
claude_code_session_countCountersessionsservice_name
claude_code_tool_use_countCountercallstool_name, service_name

Emitted Events (Log Records)

Event NameKey Attributes
gen_ai.user.messagegen_ai.conversation.id, content (opt-in)
gen_ai.assistant.messagegen_ai.conversation.id, content (opt-in)
gen_ai.tool.messagegen_ai.tool.name, gen_ai.conversation.id

Trace Spans

Claude Code emits spans with the following structure:

  • Root span: claude_code_session (session lifecycle)
  • Child spans: claude_code_conversation_turn (per turn)
  • Leaf spans: claude_code_tool_use (per tool invocation)

Resulting GreptimeDB Tables

Tables are auto-created from OTLP ingestion:

  • claude_code_token_usage — token counter metric
  • claude_code_cost_usage — cost counter metric
  • claude_code_lines_of_code — gauge metric
  • claude_code_session_count — session counter
  • claude_code_tool_use_count — tool call counter
  • opentelemetry_traces — all trace spans
  • opentelemetry_logs — all log records (events)

Example Queries

SQL — Total tokens by model in last 24 hours:

SELECT
model,
type,
sum(value) AS total_tokens
FROM claude_code_token_usage
WHERE greptime_timestamp > now() - '24h'::INTERVAL
GROUP BY model, type
ORDER BY total_tokens DESC;

PromQL — Token rate per agent (5-minute window):

sum by (service_name) (rate(claude_code_token_usage[5m]))

SQL — Cost per agent per day:

SELECT
service_name,
date_trunc('day', greptime_timestamp) AS day,
sum(value) AS total_cost_usd
FROM claude_code_cost_usage
WHERE greptime_timestamp > now() - '7d'::INTERVAL
GROUP BY service_name, day
ORDER BY day DESC, total_cost_usd DESC;

Data Source: Codex CLI Native OTel

Configuration

Add to ~/.codex/config.toml:

[otel]
enabled = true
endpoint = "http://localhost:4318"
protocol = "http/protobuf"
service_name = "codex-agent-<name>"

Emitted Telemetry

SignalMetric/Span NameKey Attributes
Tracescodex_api_requestgen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
Tracescodex_tool_executiongen_ai.tool.name, duration
Logsgen_ai.user.messagegen_ai.conversation.id, content (opt-in)
Logsgen_ai.assistant.messagegen_ai.conversation.id, content (opt-in)

Known Limitations

  • codex exec mode supports traces and logs but does not emit metrics.
  • codex mcp-server mode has no OTel support — telemetry is not available in MCP server mode.

Resulting GreptimeDB Tables

  • opentelemetry_traces — trace spans (shared table with Claude Code spans)
  • opentelemetry_logs — log records (shared table)

Example Queries

SQL — Codex tool calls in last hour:

SELECT
span_name,
json_get_string(span_attributes, '$.gen_ai.tool.name') AS tool_name,
duration_nano / 1000000 AS duration_ms
FROM opentelemetry_traces
WHERE service_name = 'codex-agent-dev'
AND span_name = 'codex_tool_execution'
AND greptime_timestamp > now() - '1h'::INTERVAL
ORDER BY greptime_timestamp DESC;

Data Source: Host System Metrics (hostmetricsreceiver)

OTel Collector Configuration

receivers:
hostmetrics:
collection_interval: 15s
scrapers:
cpu:
metrics:
system.cpu.utilization:
enabled: true
memory:
metrics:
system.memory.utilization:
enabled: true
disk:
network:
load:
filesystem:
exclude_mount_points:
mount_points:
- /proc/*
- /sys/*
- /dev/*
match_type: regexp
process:
include:
names:
- claude
- codex
- otelcol-contrib
- podman
match_type: regexp
mute_process_all_errors: true

Resulting Metrics and GreptimeDB Tables

Metric NameGreptimeDB TableTypeUnit
system.cpu.utilizationsystem_cpu_utilizationGaugeratio (0-1)
system.memory.usagesystem_memory_usageGaugebytes
system.memory.utilizationsystem_memory_utilizationGaugeratio (0-1)
system.disk.iosystem_disk_ioCounterbytes
system.disk.operationssystem_disk_operationsCounteroperations
system.network.iosystem_network_ioCounterbytes
system.network.connectionssystem_network_connectionsGaugeconnections
system.cpu.load_average.1msystem_cpu_load_average_1mGaugeload
system.filesystem.usagesystem_filesystem_usageGaugebytes
system.filesystem.utilizationsystem_filesystem_utilizationGaugeratio (0-1)
process.cpu.utilizationprocess_cpu_utilizationGaugeratio (0-1)
process.memory.physical_usageprocess_memory_physical_usageGaugebytes

Example Queries

SQL — CPU utilization per core (5-minute RANGE aggregation, GreptimeDB syntax):

SELECT
cpu,
avg(value) RANGE '5m' AS avg_utilization
FROM system_cpu_utilization
ALIGN '5m' BY (cpu) FILL PREV;

The RANGE ... ALIGN ... FILL syntax is GreptimeDB-specific time-series aggregation. RANGE defines the window, ALIGN the bucket interval, BY the grouping key, and FILL the gap-filling strategy (PREV, LINEAR, NULL, or a constant).

SQL — Memory utilization over time (1-minute intervals, RANGE syntax):

SELECT
avg(value) RANGE '1m' AS avg_memory_util
FROM system_memory_utilization
ALIGN '1m' FILL LINEAR;

PromQL — System load average:

system_cpu_load_average_1m

SQL — Per-process CPU usage for agent processes:

SELECT
process_command,
avg(value) AS avg_cpu
FROM process_cpu_utilization
WHERE greptime_timestamp > now() - '30m'::INTERVAL
GROUP BY process_command
ORDER BY avg_cpu DESC;

Data Source: Podman Container Metrics (podmanreceiver)

OTel Collector Configuration

receivers:
podman_stats:
endpoint: unix:///run/user/1000/podman/podman.sock
collection_interval: 15s

The socket path uses the UID of the rootless Podman user. Replace 1000 with the actual UID (id -u).

Prerequisite: Enable the Podman socket for the rootless user:

systemctl --user enable --now podman.socket

Collected Metrics

Metric NameGreptimeDB TableTypeUnitNotes
container.cpu.usage.totalcontainer_cpu_usage_totalCounternanoseconds
container.cpu.percentcontainer_cpu_percentGaugepercent
container.memory.usagecontainer_memory_usageGaugebytes
container.memory.percentcontainer_memory_percentGaugepercent
container.blockio.io_service_bytes_recursive.readcontainer_blockio_io_service_bytes_recursive_readCounterbytes
container.blockio.io_service_bytes_recursive.writecontainer_blockio_io_service_bytes_recursive_writeCounterbytes

Rootless Cgroups v2 Limitation

Network I/O stats are NOT available in rootless Podman with cgroups v2. The container.network.io.* metrics will not be populated. This is a kernel-level limitation — the network namespace in rootless containers does not expose traffic counters through cgroups.

Alternative: The dockerstatsreceiver can be used with the Podman socket (Podman provides Docker-compatible API), but it has the same network stats limitation under rootless cgroups v2.

Container-to-Agent Correlation

Correlate container metrics with agent telemetry using container labels:

# When launching agent containers, set identifying labels:
podman run --label agent.name=b4-dev --label agent.role=developer ...

These labels appear as metric dimensions in GreptimeDB, enabling JOINs between container metrics and agent traces.

Example Queries

SQL — Container CPU usage by container name:

SELECT
container_name,
avg(value) AS avg_cpu_percent
FROM container_cpu_percent
WHERE greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY container_name
ORDER BY avg_cpu_percent DESC;

SQL — Container memory usage (top consumers):

SELECT
container_name,
max(value) / 1048576 AS max_memory_mb
FROM container_memory_usage
WHERE greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY container_name
ORDER BY max_memory_mb DESC
LIMIT 10;

OTel Collector Pipeline Configuration

Complete Configuration Example

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

hostmetrics:
collection_interval: 15s
scrapers:
cpu:
metrics:
system.cpu.utilization:
enabled: true
memory:
metrics:
system.memory.utilization:
enabled: true
disk:
network:
load:
filesystem:
exclude_mount_points:
mount_points:
- /proc/*
- /sys/*
- /dev/*
match_type: regexp
process:
include:
names:
- claude
- codex
- otelcol-contrib
- podman
match_type: regexp
mute_process_all_errors: true

podman_stats:
endpoint: unix:///run/user/1000/podman/podman.sock
collection_interval: 15s

processors:
memory_limiter:
check_interval: 5s
limit_mib: 256
spike_limit_mib: 64

batch:
send_batch_size: 1024
timeout: 5s

resource:
attributes:
- key: host.name
value: mimas
action: upsert
- key: deployment.environment
value: production
action: upsert

exporters:
otlphttp:
endpoint: http://<greptimedb-host>:4000/v1/otlp
tls:
insecure: true
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s

debug:
verbosity: basic

service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp, debug]
metrics:
receivers: [otlp, hostmetrics, podman_stats]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp]

telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888

Replace <greptimedb-host> with the actual hostname or IP of the GreptimeDB machine.

Systemd Unit File

[Unit]
Description=OpenTelemetry Collector (contrib)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/otelcol-contrib --config=/etc/otelcol-contrib/config.yaml
Restart=always
RestartSec=10
MemoryMax=512M
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

OTel GenAI Semantic Conventions Reference

Status: The OTel GenAI Semantic Conventions are in Development status. Attribute names and semantics may change in future releases.

Span Types Relevant to b4arena

Span Name PatternOperationDescription
chatgen_ai.operation.name = "chat"LLM conversation turn
execute_toolgen_ai.operation.name = "execute_tool"Tool/function execution
invoke_agentgen_ai.operation.name = "invoke_agent"Agent-to-agent delegation
create_agentgen_ai.operation.name = "create_agent"Agent instantiation

Key Attributes

AttributeTypeDescription
gen_ai.usage.input_tokensintInput tokens consumed
gen_ai.usage.output_tokensintOutput tokens produced
gen_ai.request.modelstringModel identifier (e.g., claude-sonnet-4-20250514)
gen_ai.conversation.idstringConversation/session identifier
gen_ai.agent.namestringAgent name
gen_ai.tool.namestringTool/function name
gen_ai.systemstringAI system identifier (e.g., anthropic, openai)
gen_ai.response.finish_reasonsstring[]Why the model stopped generating

Standard Metrics

MetricTypeUnitDescription
gen_ai.client.token.usageHistogramtokensToken usage distribution
gen_ai.client.operation.durationHistogramsecondsOperation duration

Content Capture

Content capture (prompts, completions) follows the opt-in model:

  • Captured as OTel log events linked to the parent span via trace_id and span_id
  • Not stored as span attributes (avoids span payload inflation)
  • Controlled via environment variable (OTEL_LOG_USER_PROMPTS=1 to enable content capture)
  • Stored in the opentelemetry_logs table, queryable via trace_id correlation

GreptimeDB Auto-Created Tables

When GreptimeDB receives OTLP data, it auto-creates tables based on signal type and metric names.

Traces Table: opentelemetry_traces

The traces table has approximately 19 columns:

ColumnTypeDescription
greptime_timestampTimestampNanosecondSpan start time (primary time index)
trace_idStringW3C trace ID
span_idStringSpan ID
parent_span_idStringParent span ID (empty for root)
trace_stateStringW3C trace state
span_nameStringOperation name
span_kindStringINTERNAL, SERVER, CLIENT, etc.
service_nameStringFrom resource attributes
resource_attributesJSONAll resource attributes
scope_nameStringInstrumentation scope
scope_versionStringScope version
span_attributesJSONAll span attributes (flattened or JSON)
duration_nanoInt64Span duration in nanoseconds
status_codeStringOK, ERROR, UNSET
status_messageStringError message if status is ERROR
span_eventsJSONSpan events array
span_linksJSONSpan links array
resource_schema_urlStringResource schema URL
scope_schema_urlStringScope schema URL

Logs Table: opentelemetry_logs

The logs table has approximately 17 columns:

ColumnTypeDescription
greptime_timestampTimestampNanosecondLog timestamp (primary time index)
trace_idStringAssociated trace ID
span_idStringAssociated span ID
trace_flagsUInt32Trace flags
severity_textStringINFO, WARN, ERROR, etc.
severity_numberInt32Numeric severity
bodyStringLog message body
log_attributesJSONLog-level attributes
resource_attributesJSONResource attributes
scope_nameStringInstrumentation scope
scope_versionStringScope version
service_nameStringFrom resource attributes
resource_schema_urlStringResource schema URL
scope_schema_urlStringScope schema URL
observed_timestampTimestampNanosecondWhen log was collected
flagsUInt32Log record flags
dropped_attributes_countUInt32Dropped attributes count

Per-Metric Tables

Each OTLP metric creates a dedicated table. The naming convention replaces dots with underscores:

OTLP Metric NameGreptimeDB Table Name
system.cpu.utilizationsystem_cpu_utilization
system.memory.usagesystem_memory_usage
claude_code.token.usageclaude_code_token_usage
claude_code.cost.usageclaude_code_cost_usage
container.cpu.usage.totalcontainer_cpu_usage_total

Each metric table has columns for greptime_timestamp (time index), greptime_value or value (the metric value), and one column per label/attribute dimension.


Cross-Signal Query Examples

These queries demonstrate the "data lake" value — correlating across metrics, traces, and logs in a single SQL database.

Token Costs Per Agent Per Day

SELECT
service_name AS agent,
date_trunc('day', greptime_timestamp) AS day,
sum(value) AS total_cost_usd
FROM claude_code_cost_usage
WHERE greptime_timestamp > now() - '7d'::INTERVAL
GROUP BY agent, day
ORDER BY day DESC, total_cost_usd DESC;

CPU Usage During a Specific Trace

SELECT
t.span_name,
t.duration_nano / 1000000000.0 AS duration_sec,
avg(c.value) AS avg_cpu_util
FROM opentelemetry_traces t
JOIN system_cpu_utilization c
ON c.greptime_timestamp >= t.greptime_timestamp
AND c.greptime_timestamp <= t.greptime_timestamp + t.duration_nano * '1 nanosecond'::INTERVAL
WHERE t.trace_id = '<trace-id>'
GROUP BY t.span_name, t.duration_nano
ORDER BY duration_sec DESC;

Tool Call Frequency and Duration by Agent

SELECT
service_name AS agent,
json_get_string(span_attributes, '$.gen_ai.tool.name') AS tool_name,
count(*) AS call_count,
avg(duration_nano) / 1000000 AS avg_duration_ms,
max(duration_nano) / 1000000 AS max_duration_ms
FROM opentelemetry_traces
WHERE span_name IN ('execute_tool', 'claude_code_tool_use')
AND greptime_timestamp > now() - '24h'::INTERVAL
GROUP BY agent, tool_name
ORDER BY call_count DESC;

Agent Session Timeline with Correlated System Metrics

SELECT
t.service_name AS agent,
t.span_name,
t.greptime_timestamp AS started_at,
t.duration_nano / 1000000000.0 AS duration_sec,
m.avg_memory_util
FROM opentelemetry_traces t
LEFT JOIN (
SELECT
date_bin('1 minute'::INTERVAL, greptime_timestamp, '1970-01-01T00:00:00'::TIMESTAMP) AS ts,
avg(value) AS avg_memory_util
FROM system_memory_utilization
WHERE greptime_timestamp > now() - '24h'::INTERVAL
GROUP BY ts
) m ON m.ts = date_bin('1 minute'::INTERVAL, t.greptime_timestamp, '1970-01-01T00:00:00'::TIMESTAMP)
WHERE t.parent_span_id = ''
AND t.greptime_timestamp > now() - '24h'::INTERVAL
ORDER BY t.greptime_timestamp DESC;

Token Rate Per Agent (RANGE/ALIGN/FILL)

Uses GreptimeDB's native time-series aggregation for bucketed token rates:

SELECT
service_name AS agent,
sum(value) RANGE '5m' AS tokens_per_5m
FROM claude_code_token_usage
ALIGN '5m' BY (service_name) FILL 0;

Container Resource Usage vs. Token Throughput

SELECT
c.container_name,
avg(c.value) AS avg_cpu_percent,
tok.total_tokens
FROM container_cpu_percent c
JOIN (
SELECT
service_name,
sum(value) AS total_tokens
FROM claude_code_token_usage
WHERE greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY service_name
) tok ON tok.service_name LIKE '%' || c.container_name || '%'
WHERE c.greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY c.container_name, tok.total_tokens
ORDER BY tok.total_tokens DESC;