Observability Data Lake — Integration Reference
This document provides the complete data source inventory, configuration snippets, telemetry mapping, and query examples for the b4arena observability data lake. It serves as the technical reference for implementors deploying the architecture described in the ADR.
Architecture Overview
┌─── MIMAS ────────────────────────────────────────┐
│ │
│ Claude Code agents ──┐ │
│ Codex CLI agents ────┤── OTLP/gRPC ──┐ │
│ │ │ │
│ OTel Collector ◄─────┘ │ │
│ ├─ otlp receiver ◄───────────────────┘ │
│ ├─ hostmetricsreceiver │
│ ├─ podmanreceiver │
│ ├─ memory_limiter processor │
│ ├─ batch processor │
│ └─ otlphttp exporter ───────────────────────► │
└──────────────────────────────────────────────┬───┘
│
OTLP/HTTP │
▼
┌─── GREPTIMEDB HOST ────────┐
│ :4000 HTTP / PromQL / OTLP │
│ :4001 gRPC │
│ :4002 SQL (MySQL-compat) │
│ │
│ Grafana │
│ └─ GreptimeDB plugin │
│ (SQL + PromQL sources) │
└──────────────────────────────┘
Key topology: OTel Collector runs on Mimas. GreptimeDB runs on a separate dedicated machine. All agent telemetry flows through the Collector, which exports to GreptimeDB via OTLP/HTTP.
Data Source: Claude Code Native OTel
Configuration
Set these environment variables in the agent's runtime environment:
CLAUDE_CODE_ENABLE_TELEMETRY=1
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_METRICS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=claude-code-agent-<name>
Cardinality Controls
# Include session_id as a metric dimension (increases cardinality)
OTEL_METRICS_INCLUDE_SESSION_ID=false
# Reduce cardinality by excluding high-cardinality attributes
OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT=256
Emitted Metrics
| Metric Name | Type | Unit | Key Dimensions |
|---|---|---|---|
claude_code_token_usage | Counter | tokens | model, type (input/output/cache_read/cache_creation), service_name |
claude_code_cost_usage | Counter | USD | model, service_name |
claude_code_lines_of_code | Gauge | lines | file_type, change_type (added/removed), service_name |
claude_code_session_count | Counter | sessions | service_name |
claude_code_tool_use_count | Counter | calls | tool_name, service_name |
Emitted Events (Log Records)
| Event Name | Key Attributes |
|---|---|
gen_ai.user.message | gen_ai.conversation.id, content (opt-in) |
gen_ai.assistant.message | gen_ai.conversation.id, content (opt-in) |
gen_ai.tool.message | gen_ai.tool.name, gen_ai.conversation.id |
Trace Spans
Claude Code emits spans with the following structure:
- Root span:
claude_code_session(session lifecycle) - Child spans:
claude_code_conversation_turn(per turn) - Leaf spans:
claude_code_tool_use(per tool invocation)
Resulting GreptimeDB Tables
Tables are auto-created from OTLP ingestion:
claude_code_token_usage— token counter metricclaude_code_cost_usage— cost counter metricclaude_code_lines_of_code— gauge metricclaude_code_session_count— session counterclaude_code_tool_use_count— tool call counteropentelemetry_traces— all trace spansopentelemetry_logs— all log records (events)
Example Queries
SQL — Total tokens by model in last 24 hours:
SELECT
model,
type,
sum(value) AS total_tokens
FROM claude_code_token_usage
WHERE greptime_timestamp > now() - '24h'::INTERVAL
GROUP BY model, type
ORDER BY total_tokens DESC;
PromQL — Token rate per agent (5-minute window):
sum by (service_name) (rate(claude_code_token_usage[5m]))
SQL — Cost per agent per day:
SELECT
service_name,
date_trunc('day', greptime_timestamp) AS day,
sum(value) AS total_cost_usd
FROM claude_code_cost_usage
WHERE greptime_timestamp > now() - '7d'::INTERVAL
GROUP BY service_name, day
ORDER BY day DESC, total_cost_usd DESC;
Data Source: Codex CLI Native OTel
Configuration
Add to ~/.codex/config.toml:
[otel]
enabled = true
endpoint = "http://localhost:4318"
protocol = "http/protobuf"
service_name = "codex-agent-<name>"
Emitted Telemetry
| Signal | Metric/Span Name | Key Attributes |
|---|---|---|
| Traces | codex_api_request | gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens |
| Traces | codex_tool_execution | gen_ai.tool.name, duration |
| Logs | gen_ai.user.message | gen_ai.conversation.id, content (opt-in) |
| Logs | gen_ai.assistant.message | gen_ai.conversation.id, content (opt-in) |
Known Limitations
codex execmode supports traces and logs but does not emit metrics.codex mcp-servermode has no OTel support — telemetry is not available in MCP server mode.
Resulting GreptimeDB Tables
opentelemetry_traces— trace spans (shared table with Claude Code spans)opentelemetry_logs— log records (shared table)
Example Queries
SQL — Codex tool calls in last hour:
SELECT
span_name,
json_get_string(span_attributes, '$.gen_ai.tool.name') AS tool_name,
duration_nano / 1000000 AS duration_ms
FROM opentelemetry_traces
WHERE service_name = 'codex-agent-dev'
AND span_name = 'codex_tool_execution'
AND greptime_timestamp > now() - '1h'::INTERVAL
ORDER BY greptime_timestamp DESC;
Data Source: Host System Metrics (hostmetricsreceiver)
OTel Collector Configuration
receivers:
hostmetrics:
collection_interval: 15s
scrapers:
cpu:
metrics:
system.cpu.utilization:
enabled: true
memory:
metrics:
system.memory.utilization:
enabled: true
disk:
network:
load:
filesystem:
exclude_mount_points:
mount_points:
- /proc/*
- /sys/*
- /dev/*
match_type: regexp
process:
include:
names:
- claude
- codex
- otelcol-contrib
- podman
match_type: regexp
mute_process_all_errors: true
Resulting Metrics and GreptimeDB Tables
| Metric Name | GreptimeDB Table | Type | Unit |
|---|---|---|---|
system.cpu.utilization | system_cpu_utilization | Gauge | ratio (0-1) |
system.memory.usage | system_memory_usage | Gauge | bytes |
system.memory.utilization | system_memory_utilization | Gauge | ratio (0-1) |
system.disk.io | system_disk_io | Counter | bytes |
system.disk.operations | system_disk_operations | Counter | operations |
system.network.io | system_network_io | Counter | bytes |
system.network.connections | system_network_connections | Gauge | connections |
system.cpu.load_average.1m | system_cpu_load_average_1m | Gauge | load |
system.filesystem.usage | system_filesystem_usage | Gauge | bytes |
system.filesystem.utilization | system_filesystem_utilization | Gauge | ratio (0-1) |
process.cpu.utilization | process_cpu_utilization | Gauge | ratio (0-1) |
process.memory.physical_usage | process_memory_physical_usage | Gauge | bytes |
Example Queries
SQL — CPU utilization per core (5-minute RANGE aggregation, GreptimeDB syntax):
SELECT
cpu,
avg(value) RANGE '5m' AS avg_utilization
FROM system_cpu_utilization
ALIGN '5m' BY (cpu) FILL PREV;
The
RANGE ... ALIGN ... FILLsyntax is GreptimeDB-specific time-series aggregation.RANGEdefines the window,ALIGNthe bucket interval,BYthe grouping key, andFILLthe gap-filling strategy (PREV,LINEAR,NULL, or a constant).
SQL — Memory utilization over time (1-minute intervals, RANGE syntax):
SELECT
avg(value) RANGE '1m' AS avg_memory_util
FROM system_memory_utilization
ALIGN '1m' FILL LINEAR;
PromQL — System load average:
system_cpu_load_average_1m
SQL — Per-process CPU usage for agent processes:
SELECT
process_command,
avg(value) AS avg_cpu
FROM process_cpu_utilization
WHERE greptime_timestamp > now() - '30m'::INTERVAL
GROUP BY process_command
ORDER BY avg_cpu DESC;
Data Source: Podman Container Metrics (podmanreceiver)
OTel Collector Configuration
receivers:
podman_stats:
endpoint: unix:///run/user/1000/podman/podman.sock
collection_interval: 15s
The socket path uses the UID of the rootless Podman user. Replace 1000 with the actual UID (id -u).
Prerequisite: Enable the Podman socket for the rootless user:
systemctl --user enable --now podman.socket
Collected Metrics
| Metric Name | GreptimeDB Table | Type | Unit | Notes |
|---|---|---|---|---|
container.cpu.usage.total | container_cpu_usage_total | Counter | nanoseconds | |
container.cpu.percent | container_cpu_percent | Gauge | percent | |
container.memory.usage | container_memory_usage | Gauge | bytes | |
container.memory.percent | container_memory_percent | Gauge | percent | |
container.blockio.io_service_bytes_recursive.read | container_blockio_io_service_bytes_recursive_read | Counter | bytes | |
container.blockio.io_service_bytes_recursive.write | container_blockio_io_service_bytes_recursive_write | Counter | bytes |
Rootless Cgroups v2 Limitation
Network I/O stats are NOT available in rootless Podman with cgroups v2. The container.network.io.* metrics will not be populated. This is a kernel-level limitation — the network namespace in rootless containers does not expose traffic counters through cgroups.
Alternative: The dockerstatsreceiver can be used with the Podman socket (Podman provides Docker-compatible API), but it has the same network stats limitation under rootless cgroups v2.
Container-to-Agent Correlation
Correlate container metrics with agent telemetry using container labels:
# When launching agent containers, set identifying labels:
podman run --label agent.name=b4-dev --label agent.role=developer ...
These labels appear as metric dimensions in GreptimeDB, enabling JOINs between container metrics and agent traces.
Example Queries
SQL — Container CPU usage by container name:
SELECT
container_name,
avg(value) AS avg_cpu_percent
FROM container_cpu_percent
WHERE greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY container_name
ORDER BY avg_cpu_percent DESC;
SQL — Container memory usage (top consumers):
SELECT
container_name,
max(value) / 1048576 AS max_memory_mb
FROM container_memory_usage
WHERE greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY container_name
ORDER BY max_memory_mb DESC
LIMIT 10;
OTel Collector Pipeline Configuration
Complete Configuration Example
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
hostmetrics:
collection_interval: 15s
scrapers:
cpu:
metrics:
system.cpu.utilization:
enabled: true
memory:
metrics:
system.memory.utilization:
enabled: true
disk:
network:
load:
filesystem:
exclude_mount_points:
mount_points:
- /proc/*
- /sys/*
- /dev/*
match_type: regexp
process:
include:
names:
- claude
- codex
- otelcol-contrib
- podman
match_type: regexp
mute_process_all_errors: true
podman_stats:
endpoint: unix:///run/user/1000/podman/podman.sock
collection_interval: 15s
processors:
memory_limiter:
check_interval: 5s
limit_mib: 256
spike_limit_mib: 64
batch:
send_batch_size: 1024
timeout: 5s
resource:
attributes:
- key: host.name
value: mimas
action: upsert
- key: deployment.environment
value: production
action: upsert
exporters:
otlphttp:
endpoint: http://<greptimedb-host>:4000/v1/otlp
tls:
insecure: true
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
debug:
verbosity: basic
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp, debug]
metrics:
receivers: [otlp, hostmetrics, podman_stats]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp]
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888
Replace <greptimedb-host> with the actual hostname or IP of the GreptimeDB machine.
Systemd Unit File
[Unit]
Description=OpenTelemetry Collector (contrib)
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/otelcol-contrib --config=/etc/otelcol-contrib/config.yaml
Restart=always
RestartSec=10
MemoryMax=512M
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
OTel GenAI Semantic Conventions Reference
Status: The OTel GenAI Semantic Conventions are in Development status. Attribute names and semantics may change in future releases.
Span Types Relevant to b4arena
| Span Name Pattern | Operation | Description |
|---|---|---|
chat | gen_ai.operation.name = "chat" | LLM conversation turn |
execute_tool | gen_ai.operation.name = "execute_tool" | Tool/function execution |
invoke_agent | gen_ai.operation.name = "invoke_agent" | Agent-to-agent delegation |
create_agent | gen_ai.operation.name = "create_agent" | Agent instantiation |
Key Attributes
| Attribute | Type | Description |
|---|---|---|
gen_ai.usage.input_tokens | int | Input tokens consumed |
gen_ai.usage.output_tokens | int | Output tokens produced |
gen_ai.request.model | string | Model identifier (e.g., claude-sonnet-4-20250514) |
gen_ai.conversation.id | string | Conversation/session identifier |
gen_ai.agent.name | string | Agent name |
gen_ai.tool.name | string | Tool/function name |
gen_ai.system | string | AI system identifier (e.g., anthropic, openai) |
gen_ai.response.finish_reasons | string[] | Why the model stopped generating |
Standard Metrics
| Metric | Type | Unit | Description |
|---|---|---|---|
gen_ai.client.token.usage | Histogram | tokens | Token usage distribution |
gen_ai.client.operation.duration | Histogram | seconds | Operation duration |
Content Capture
Content capture (prompts, completions) follows the opt-in model:
- Captured as OTel log events linked to the parent span via
trace_idandspan_id - Not stored as span attributes (avoids span payload inflation)
- Controlled via environment variable (
OTEL_LOG_USER_PROMPTS=1to enable content capture) - Stored in the
opentelemetry_logstable, queryable viatrace_idcorrelation
GreptimeDB Auto-Created Tables
When GreptimeDB receives OTLP data, it auto-creates tables based on signal type and metric names.
Traces Table: opentelemetry_traces
The traces table has approximately 19 columns:
| Column | Type | Description |
|---|---|---|
greptime_timestamp | TimestampNanosecond | Span start time (primary time index) |
trace_id | String | W3C trace ID |
span_id | String | Span ID |
parent_span_id | String | Parent span ID (empty for root) |
trace_state | String | W3C trace state |
span_name | String | Operation name |
span_kind | String | INTERNAL, SERVER, CLIENT, etc. |
service_name | String | From resource attributes |
resource_attributes | JSON | All resource attributes |
scope_name | String | Instrumentation scope |
scope_version | String | Scope version |
span_attributes | JSON | All span attributes (flattened or JSON) |
duration_nano | Int64 | Span duration in nanoseconds |
status_code | String | OK, ERROR, UNSET |
status_message | String | Error message if status is ERROR |
span_events | JSON | Span events array |
span_links | JSON | Span links array |
resource_schema_url | String | Resource schema URL |
scope_schema_url | String | Scope schema URL |
Logs Table: opentelemetry_logs
The logs table has approximately 17 columns:
| Column | Type | Description |
|---|---|---|
greptime_timestamp | TimestampNanosecond | Log timestamp (primary time index) |
trace_id | String | Associated trace ID |
span_id | String | Associated span ID |
trace_flags | UInt32 | Trace flags |
severity_text | String | INFO, WARN, ERROR, etc. |
severity_number | Int32 | Numeric severity |
body | String | Log message body |
log_attributes | JSON | Log-level attributes |
resource_attributes | JSON | Resource attributes |
scope_name | String | Instrumentation scope |
scope_version | String | Scope version |
service_name | String | From resource attributes |
resource_schema_url | String | Resource schema URL |
scope_schema_url | String | Scope schema URL |
observed_timestamp | TimestampNanosecond | When log was collected |
flags | UInt32 | Log record flags |
dropped_attributes_count | UInt32 | Dropped attributes count |
Per-Metric Tables
Each OTLP metric creates a dedicated table. The naming convention replaces dots with underscores:
| OTLP Metric Name | GreptimeDB Table Name |
|---|---|
system.cpu.utilization | system_cpu_utilization |
system.memory.usage | system_memory_usage |
claude_code.token.usage | claude_code_token_usage |
claude_code.cost.usage | claude_code_cost_usage |
container.cpu.usage.total | container_cpu_usage_total |
Each metric table has columns for greptime_timestamp (time index), greptime_value or value (the metric value), and one column per label/attribute dimension.
Cross-Signal Query Examples
These queries demonstrate the "data lake" value — correlating across metrics, traces, and logs in a single SQL database.
Token Costs Per Agent Per Day
SELECT
service_name AS agent,
date_trunc('day', greptime_timestamp) AS day,
sum(value) AS total_cost_usd
FROM claude_code_cost_usage
WHERE greptime_timestamp > now() - '7d'::INTERVAL
GROUP BY agent, day
ORDER BY day DESC, total_cost_usd DESC;
CPU Usage During a Specific Trace
SELECT
t.span_name,
t.duration_nano / 1000000000.0 AS duration_sec,
avg(c.value) AS avg_cpu_util
FROM opentelemetry_traces t
JOIN system_cpu_utilization c
ON c.greptime_timestamp >= t.greptime_timestamp
AND c.greptime_timestamp <= t.greptime_timestamp + t.duration_nano * '1 nanosecond'::INTERVAL
WHERE t.trace_id = '<trace-id>'
GROUP BY t.span_name, t.duration_nano
ORDER BY duration_sec DESC;
Tool Call Frequency and Duration by Agent
SELECT
service_name AS agent,
json_get_string(span_attributes, '$.gen_ai.tool.name') AS tool_name,
count(*) AS call_count,
avg(duration_nano) / 1000000 AS avg_duration_ms,
max(duration_nano) / 1000000 AS max_duration_ms
FROM opentelemetry_traces
WHERE span_name IN ('execute_tool', 'claude_code_tool_use')
AND greptime_timestamp > now() - '24h'::INTERVAL
GROUP BY agent, tool_name
ORDER BY call_count DESC;
Agent Session Timeline with Correlated System Metrics
SELECT
t.service_name AS agent,
t.span_name,
t.greptime_timestamp AS started_at,
t.duration_nano / 1000000000.0 AS duration_sec,
m.avg_memory_util
FROM opentelemetry_traces t
LEFT JOIN (
SELECT
date_bin('1 minute'::INTERVAL, greptime_timestamp, '1970-01-01T00:00:00'::TIMESTAMP) AS ts,
avg(value) AS avg_memory_util
FROM system_memory_utilization
WHERE greptime_timestamp > now() - '24h'::INTERVAL
GROUP BY ts
) m ON m.ts = date_bin('1 minute'::INTERVAL, t.greptime_timestamp, '1970-01-01T00:00:00'::TIMESTAMP)
WHERE t.parent_span_id = ''
AND t.greptime_timestamp > now() - '24h'::INTERVAL
ORDER BY t.greptime_timestamp DESC;
Token Rate Per Agent (RANGE/ALIGN/FILL)
Uses GreptimeDB's native time-series aggregation for bucketed token rates:
SELECT
service_name AS agent,
sum(value) RANGE '5m' AS tokens_per_5m
FROM claude_code_token_usage
ALIGN '5m' BY (service_name) FILL 0;
Container Resource Usage vs. Token Throughput
SELECT
c.container_name,
avg(c.value) AS avg_cpu_percent,
tok.total_tokens
FROM container_cpu_percent c
JOIN (
SELECT
service_name,
sum(value) AS total_tokens
FROM claude_code_token_usage
WHERE greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY service_name
) tok ON tok.service_name LIKE '%' || c.container_name || '%'
WHERE c.greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY c.container_name, tok.total_tokens
ORDER BY tok.total_tokens DESC;