Observability Data Lake — Integration Reference

This document provides the complete data source inventory, configuration snippets, telemetry mapping, and query examples for the b4arena observability data lake. It serves as the technical reference for implementors deploying the architecture described in the ADR.

Architecture Overview

┌─── MIMAS ────────────────────────────────────────┐
│                                                   │
│  Claude Code agents ──┐                           │
│  Codex CLI agents ────┤── OTLP/gRPC ──┐          │
│                       │                │          │
│  OTel Collector ◄─────┘                │          │
│  ├─ otlp receiver ◄───────────────────┘          │
│  ├─ hostmetricsreceiver                           │
│  ├─ podmanreceiver                                │
│  ├─ memory_limiter processor                      │
│  ├─ batch processor                               │
│  └─ otlphttp exporter ───────────────────────►   │
└──────────────────────────────────────────────┬───┘
                                               │
                                     OTLP/HTTP │
                                               ▼
                           ┌─── GREPTIMEDB HOST ────────┐
                           │  :4000 HTTP / PromQL / OTLP │
                           │  :4001 gRPC                  │
                           │  :4002 SQL (MySQL-compat)    │
                           │                              │
                           │  Grafana                     │
                           │  └─ GreptimeDB plugin        │
                           │     (SQL + PromQL sources)   │
                           └──────────────────────────────┘

Key topology: OTel Collector runs on Mimas. GreptimeDB runs on a separate dedicated machine. All agent telemetry flows through the Collector, which exports to GreptimeDB via OTLP/HTTP.

Data Source: Claude Code Native OTel

Configuration

Set these environment variables in the agent's runtime environment:

CLAUDE_CODE_ENABLE_TELEMETRY=1
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_METRICS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=claude-code-agent-<name>

Cardinality Controls

# Include session_id as a metric dimension (increases cardinality)
OTEL_METRICS_INCLUDE_SESSION_ID=false

# Reduce cardinality by excluding high-cardinality attributes
OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT=256

Emitted Metrics

Metric Name	Type	Unit	Key Dimensions
`claude_code_token_usage`	Counter	tokens	`model`, `type` (input/output/cache_read/cache_creation), `service_name`
`claude_code_cost_usage`	Counter	USD	`model`, `service_name`
`claude_code_lines_of_code`	Gauge	lines	`file_type`, `change_type` (added/removed), `service_name`
`claude_code_session_count`	Counter	sessions	`service_name`
`claude_code_tool_use_count`	Counter	calls	`tool_name`, `service_name`

Emitted Events (Log Records)

Event Name	Key Attributes
`gen_ai.user.message`	`gen_ai.conversation.id`, content (opt-in)
`gen_ai.assistant.message`	`gen_ai.conversation.id`, content (opt-in)
`gen_ai.tool.message`	`gen_ai.tool.name`, `gen_ai.conversation.id`

Trace Spans

Claude Code emits spans with the following structure:

Root span: claude_code_session (session lifecycle)
Child spans: claude_code_conversation_turn (per turn)
Leaf spans: claude_code_tool_use (per tool invocation)

Resulting GreptimeDB Tables

Tables are auto-created from OTLP ingestion:

claude_code_token_usage — token counter metric
claude_code_cost_usage — cost counter metric
claude_code_lines_of_code — gauge metric
claude_code_session_count — session counter
claude_code_tool_use_count — tool call counter
opentelemetry_traces — all trace spans
opentelemetry_logs — all log records (events)

Example Queries

SQL — Total tokens by model in last 24 hours:

SELECT
  model,
  type,
  sum(value) AS total_tokens
FROM claude_code_token_usage
WHERE greptime_timestamp > now() - '24h'::INTERVAL
GROUP BY model, type
ORDER BY total_tokens DESC;

PromQL — Token rate per agent (5-minute window):

sum by (service_name) (rate(claude_code_token_usage[5m]))

SQL — Cost per agent per day:

SELECT
  service_name,
  date_trunc('day', greptime_timestamp) AS day,
  sum(value) AS total_cost_usd
FROM claude_code_cost_usage
WHERE greptime_timestamp > now() - '7d'::INTERVAL
GROUP BY service_name, day
ORDER BY day DESC, total_cost_usd DESC;

Data Source: Codex CLI Native OTel

Configuration

Add to ~/.codex/config.toml:

[otel]
enabled = true
endpoint = "http://localhost:4318"
protocol = "http/protobuf"
service_name = "codex-agent-<name>"

Emitted Telemetry

Signal	Metric/Span Name	Key Attributes
Traces	`codex_api_request`	`gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`
Traces	`codex_tool_execution`	`gen_ai.tool.name`, duration
Logs	`gen_ai.user.message`	`gen_ai.conversation.id`, content (opt-in)
Logs	`gen_ai.assistant.message`	`gen_ai.conversation.id`, content (opt-in)

Known Limitations

codex exec mode supports traces and logs but does not emit metrics.
codex mcp-server mode has no OTel support — telemetry is not available in MCP server mode.

Resulting GreptimeDB Tables

opentelemetry_traces — trace spans (shared table with Claude Code spans)
opentelemetry_logs — log records (shared table)

Example Queries

SQL — Codex tool calls in last hour:

SELECT
  span_name,
  json_get_string(span_attributes, '$.gen_ai.tool.name') AS tool_name,
  duration_nano / 1000000 AS duration_ms
FROM opentelemetry_traces
WHERE service_name = 'codex-agent-dev'
  AND span_name = 'codex_tool_execution'
  AND greptime_timestamp > now() - '1h'::INTERVAL
ORDER BY greptime_timestamp DESC;

Data Source: Host System Metrics (hostmetricsreceiver)

OTel Collector Configuration

receivers:
  hostmetrics:
    collection_interval: 15s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      disk:
      network:
      load:
      filesystem:
        exclude_mount_points:
          mount_points:
            - /proc/*
            - /sys/*
            - /dev/*
          match_type: regexp
      process:
        include:
          names:
            - claude
            - codex
            - otelcol-contrib
            - podman
          match_type: regexp
        mute_process_all_errors: true

Resulting Metrics and GreptimeDB Tables

Metric Name	GreptimeDB Table	Type	Unit
`system.cpu.utilization`	`system_cpu_utilization`	Gauge	ratio (0-1)
`system.memory.usage`	`system_memory_usage`	Gauge	bytes
`system.memory.utilization`	`system_memory_utilization`	Gauge	ratio (0-1)
`system.disk.io`	`system_disk_io`	Counter	bytes
`system.disk.operations`	`system_disk_operations`	Counter	operations
`system.network.io`	`system_network_io`	Counter	bytes
`system.network.connections`	`system_network_connections`	Gauge	connections
`system.cpu.load_average.1m`	`system_cpu_load_average_1m`	Gauge	load
`system.filesystem.usage`	`system_filesystem_usage`	Gauge	bytes
`system.filesystem.utilization`	`system_filesystem_utilization`	Gauge	ratio (0-1)
`process.cpu.utilization`	`process_cpu_utilization`	Gauge	ratio (0-1)
`process.memory.physical_usage`	`process_memory_physical_usage`	Gauge	bytes

Example Queries

SQL — CPU utilization per core (5-minute RANGE aggregation, GreptimeDB syntax):

SELECT
  cpu,
  avg(value) RANGE '5m' AS avg_utilization
FROM system_cpu_utilization
ALIGN '5m' BY (cpu) FILL PREV;

The RANGE ... ALIGN ... FILL syntax is GreptimeDB-specific time-series aggregation. RANGE defines the window, ALIGN the bucket interval, BY the grouping key, and FILL the gap-filling strategy (PREV, LINEAR, NULL, or a constant).

SQL — Memory utilization over time (1-minute intervals, RANGE syntax):

SELECT
  avg(value) RANGE '1m' AS avg_memory_util
FROM system_memory_utilization
ALIGN '1m' FILL LINEAR;

PromQL — System load average:

system_cpu_load_average_1m

SQL — Per-process CPU usage for agent processes:

SELECT
  process_command,
  avg(value) AS avg_cpu
FROM process_cpu_utilization
WHERE greptime_timestamp > now() - '30m'::INTERVAL
GROUP BY process_command
ORDER BY avg_cpu DESC;

Data Source: Podman Container Metrics (podmanreceiver)

OTel Collector Configuration

receivers:
  podman_stats:
    endpoint: unix:///run/user/1000/podman/podman.sock
    collection_interval: 15s

The socket path uses the UID of the rootless Podman user. Replace 1000 with the actual UID (id -u).

Prerequisite: Enable the Podman socket for the rootless user:

systemctl --user enable --now podman.socket

Collected Metrics

Metric Name	GreptimeDB Table	Type	Unit
`container.cpu.usage.total`	`container_cpu_usage_total`	Counter	nanoseconds
`container.cpu.percent`	`container_cpu_percent`	Gauge	percent
`container.memory.usage`	`container_memory_usage`	Gauge	bytes
`container.memory.percent`	`container_memory_percent`	Gauge	percent
`container.blockio.io_service_bytes_recursive.read`	`container_blockio_io_service_bytes_recursive_read`	Counter	bytes
`container.blockio.io_service_bytes_recursive.write`	`container_blockio_io_service_bytes_recursive_write`	Counter	bytes

Rootless Cgroups v2 Limitation

Network I/O stats are NOT available in rootless Podman with cgroups v2. The container.network.io.* metrics will not be populated. This is a kernel-level limitation — the network namespace in rootless containers does not expose traffic counters through cgroups.

Alternative: The dockerstatsreceiver can be used with the Podman socket (Podman provides Docker-compatible API), but it has the same network stats limitation under rootless cgroups v2.

Container-to-Agent Correlation

Correlate container metrics with agent telemetry using container labels:

# When launching agent containers, set identifying labels:
podman run --label agent.name=b4-dev --label agent.role=developer ...

These labels appear as metric dimensions in GreptimeDB, enabling JOINs between container metrics and agent traces.

Example Queries

SQL — Container CPU usage by container name:

SELECT
  container_name,
  avg(value) AS avg_cpu_percent
FROM container_cpu_percent
WHERE greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY container_name
ORDER BY avg_cpu_percent DESC;

SQL — Container memory usage (top consumers):

SELECT
  container_name,
  max(value) / 1048576 AS max_memory_mb
FROM container_memory_usage
WHERE greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY container_name
ORDER BY max_memory_mb DESC
LIMIT 10;

OTel Collector Pipeline Configuration

Complete Configuration Example

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  hostmetrics:
    collection_interval: 15s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      disk:
      network:
      load:
      filesystem:
        exclude_mount_points:
          mount_points:
            - /proc/*
            - /sys/*
            - /dev/*
          match_type: regexp
      process:
        include:
          names:
            - claude
            - codex
            - otelcol-contrib
            - podman
          match_type: regexp
        mute_process_all_errors: true

  podman_stats:
    endpoint: unix:///run/user/1000/podman/podman.sock
    collection_interval: 15s

processors:
  memory_limiter:
    check_interval: 5s
    limit_mib: 256
    spike_limit_mib: 64

  batch:
    send_batch_size: 1024
    timeout: 5s

  resource:
    attributes:
      - key: host.name
        value: mimas
        action: upsert
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlphttp:
    endpoint: http://<greptimedb-host>:4000/v1/otlp
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp, debug]
    metrics:
      receivers: [otlp, hostmetrics, podman_stats]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp]

  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

Replace <greptimedb-host> with the actual hostname or IP of the GreptimeDB machine.

Systemd Unit File

[Unit]
Description=OpenTelemetry Collector (contrib)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/otelcol-contrib --config=/etc/otelcol-contrib/config.yaml
Restart=always
RestartSec=10
MemoryMax=512M
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

OTel GenAI Semantic Conventions Reference

Status: The OTel GenAI Semantic Conventions are in Development status. Attribute names and semantics may change in future releases.

Span Types Relevant to b4arena

Span Name Pattern	Operation	Description
`chat`	`gen_ai.operation.name = "chat"`	LLM conversation turn
`execute_tool`	`gen_ai.operation.name = "execute_tool"`	Tool/function execution
`invoke_agent`	`gen_ai.operation.name = "invoke_agent"`	Agent-to-agent delegation
`create_agent`	`gen_ai.operation.name = "create_agent"`	Agent instantiation

Key Attributes

Attribute	Type	Description
`gen_ai.usage.input_tokens`	int	Input tokens consumed
`gen_ai.usage.output_tokens`	int	Output tokens produced
`gen_ai.request.model`	string	Model identifier (e.g., `claude-sonnet-4-20250514`)
`gen_ai.conversation.id`	string	Conversation/session identifier
`gen_ai.agent.name`	string	Agent name
`gen_ai.tool.name`	string	Tool/function name
`gen_ai.system`	string	AI system identifier (e.g., `anthropic`, `openai`)
`gen_ai.response.finish_reasons`	string[]	Why the model stopped generating

Standard Metrics

Metric	Type	Unit	Description
`gen_ai.client.token.usage`	Histogram	tokens	Token usage distribution
`gen_ai.client.operation.duration`	Histogram	seconds	Operation duration

Content Capture

Content capture (prompts, completions) follows the opt-in model:

Captured as OTel log events linked to the parent span via trace_id and span_id
Not stored as span attributes (avoids span payload inflation)
Controlled via environment variable (OTEL_LOG_USER_PROMPTS=1 to enable content capture)
Stored in the opentelemetry_logs table, queryable via trace_id correlation

GreptimeDB Auto-Created Tables

When GreptimeDB receives OTLP data, it auto-creates tables based on signal type and metric names.

Traces Table: `opentelemetry_traces`

The traces table has approximately 19 columns:

Column	Type	Description
`greptime_timestamp`	TimestampNanosecond	Span start time (primary time index)
`trace_id`	String	W3C trace ID
`span_id`	String	Span ID
`parent_span_id`	String	Parent span ID (empty for root)
`trace_state`	String	W3C trace state
`span_name`	String	Operation name
`span_kind`	String	INTERNAL, SERVER, CLIENT, etc.
`service_name`	String	From resource attributes
`resource_attributes`	JSON	All resource attributes
`scope_name`	String	Instrumentation scope
`scope_version`	String	Scope version
`span_attributes`	JSON	All span attributes (flattened or JSON)
`duration_nano`	Int64	Span duration in nanoseconds
`status_code`	String	OK, ERROR, UNSET
`status_message`	String	Error message if status is ERROR
`span_events`	JSON	Span events array
`span_links`	JSON	Span links array
`resource_schema_url`	String	Resource schema URL
`scope_schema_url`	String	Scope schema URL

Logs Table: `opentelemetry_logs`

The logs table has approximately 17 columns:

Column	Type	Description
`greptime_timestamp`	TimestampNanosecond	Log timestamp (primary time index)
`trace_id`	String	Associated trace ID
`span_id`	String	Associated span ID
`trace_flags`	UInt32	Trace flags
`severity_text`	String	INFO, WARN, ERROR, etc.
`severity_number`	Int32	Numeric severity
`body`	String	Log message body
`log_attributes`	JSON	Log-level attributes
`resource_attributes`	JSON	Resource attributes
`scope_name`	String	Instrumentation scope
`scope_version`	String	Scope version
`service_name`	String	From resource attributes
`resource_schema_url`	String	Resource schema URL
`scope_schema_url`	String	Scope schema URL
`observed_timestamp`	TimestampNanosecond	When log was collected
`flags`	UInt32	Log record flags
`dropped_attributes_count`	UInt32	Dropped attributes count

Per-Metric Tables

Each OTLP metric creates a dedicated table. The naming convention replaces dots with underscores:

OTLP Metric Name	GreptimeDB Table Name
`system.cpu.utilization`	`system_cpu_utilization`
`system.memory.usage`	`system_memory_usage`
`claude_code.token.usage`	`claude_code_token_usage`
`claude_code.cost.usage`	`claude_code_cost_usage`
`container.cpu.usage.total`	`container_cpu_usage_total`

Each metric table has columns for greptime_timestamp (time index), greptime_value or value (the metric value), and one column per label/attribute dimension.

Cross-Signal Query Examples

These queries demonstrate the "data lake" value — correlating across metrics, traces, and logs in a single SQL database.

Token Costs Per Agent Per Day

SELECT
  service_name AS agent,
  date_trunc('day', greptime_timestamp) AS day,
  sum(value) AS total_cost_usd
FROM claude_code_cost_usage
WHERE greptime_timestamp > now() - '7d'::INTERVAL
GROUP BY agent, day
ORDER BY day DESC, total_cost_usd DESC;

CPU Usage During a Specific Trace

SELECT
  t.span_name,
  t.duration_nano / 1000000000.0 AS duration_sec,
  avg(c.value) AS avg_cpu_util
FROM opentelemetry_traces t
JOIN system_cpu_utilization c
  ON c.greptime_timestamp >= t.greptime_timestamp
     AND c.greptime_timestamp <= t.greptime_timestamp + t.duration_nano * '1 nanosecond'::INTERVAL
WHERE t.trace_id = '<trace-id>'
GROUP BY t.span_name, t.duration_nano
ORDER BY duration_sec DESC;

Tool Call Frequency and Duration by Agent

SELECT
  service_name AS agent,
  json_get_string(span_attributes, '$.gen_ai.tool.name') AS tool_name,
  count(*) AS call_count,
  avg(duration_nano) / 1000000 AS avg_duration_ms,
  max(duration_nano) / 1000000 AS max_duration_ms
FROM opentelemetry_traces
WHERE span_name IN ('execute_tool', 'claude_code_tool_use')
  AND greptime_timestamp > now() - '24h'::INTERVAL
GROUP BY agent, tool_name
ORDER BY call_count DESC;

Agent Session Timeline with Correlated System Metrics

SELECT
  t.service_name AS agent,
  t.span_name,
  t.greptime_timestamp AS started_at,
  t.duration_nano / 1000000000.0 AS duration_sec,
  m.avg_memory_util
FROM opentelemetry_traces t
LEFT JOIN (
  SELECT
    date_bin('1 minute'::INTERVAL, greptime_timestamp, '1970-01-01T00:00:00'::TIMESTAMP) AS ts,
    avg(value) AS avg_memory_util
  FROM system_memory_utilization
  WHERE greptime_timestamp > now() - '24h'::INTERVAL
  GROUP BY ts
) m ON m.ts = date_bin('1 minute'::INTERVAL, t.greptime_timestamp, '1970-01-01T00:00:00'::TIMESTAMP)
WHERE t.parent_span_id = ''
  AND t.greptime_timestamp > now() - '24h'::INTERVAL
ORDER BY t.greptime_timestamp DESC;

Token Rate Per Agent (RANGE/ALIGN/FILL)

Uses GreptimeDB's native time-series aggregation for bucketed token rates:

SELECT
  service_name AS agent,
  sum(value) RANGE '5m' AS tokens_per_5m
FROM claude_code_token_usage
ALIGN '5m' BY (service_name) FILL 0;

Container Resource Usage vs. Token Throughput

SELECT
  c.container_name,
  avg(c.value) AS avg_cpu_percent,
  tok.total_tokens
FROM container_cpu_percent c
JOIN (
  SELECT
    service_name,
    sum(value) AS total_tokens
  FROM claude_code_token_usage
  WHERE greptime_timestamp > now() - '1h'::INTERVAL
  GROUP BY service_name
) tok ON tok.service_name LIKE '%' || c.container_name || '%'
WHERE c.greptime_timestamp > now() - '1h'::INTERVAL
GROUP BY c.container_name, tok.total_tokens
ORDER BY tok.total_tokens DESC;

Architecture Overview​

Data Source: Claude Code Native OTel​

Configuration​

Cardinality Controls​

Emitted Metrics​

Emitted Events (Log Records)​

Trace Spans​

Resulting GreptimeDB Tables​

Example Queries​

Data Source: Codex CLI Native OTel​

Configuration​

Emitted Telemetry​

Known Limitations​

Resulting GreptimeDB Tables​

Example Queries​

Data Source: Host System Metrics (hostmetricsreceiver)​

OTel Collector Configuration​

Resulting Metrics and GreptimeDB Tables​

Example Queries​

Data Source: Podman Container Metrics (podmanreceiver)​

OTel Collector Configuration​

Collected Metrics​

Rootless Cgroups v2 Limitation​

Container-to-Agent Correlation​

Example Queries​

OTel Collector Pipeline Configuration​

Complete Configuration Example​

Systemd Unit File​

OTel GenAI Semantic Conventions Reference​

Span Types Relevant to b4arena​

Key Attributes​

Standard Metrics​

Content Capture​

GreptimeDB Auto-Created Tables​

Traces Table: opentelemetry_traces​

Logs Table: opentelemetry_logs​

Per-Metric Tables​

Cross-Signal Query Examples​

Token Costs Per Agent Per Day​

CPU Usage During a Specific Trace​

Tool Call Frequency and Duration by Agent​

Agent Session Timeline with Correlated System Metrics​

Token Rate Per Agent (RANGE/ALIGN/FILL)​

Container Resource Usage vs. Token Throughput​

Architecture Overview

Data Source: Claude Code Native OTel

Configuration

Cardinality Controls

Emitted Metrics

Emitted Events (Log Records)

Trace Spans

Resulting GreptimeDB Tables

Example Queries

Data Source: Codex CLI Native OTel

Configuration

Emitted Telemetry

Known Limitations

Resulting GreptimeDB Tables

Example Queries

Data Source: Host System Metrics (hostmetricsreceiver)

OTel Collector Configuration

Resulting Metrics and GreptimeDB Tables

Example Queries

Data Source: Podman Container Metrics (podmanreceiver)

OTel Collector Configuration

Collected Metrics

Rootless Cgroups v2 Limitation

Container-to-Agent Correlation

Example Queries

OTel Collector Pipeline Configuration

Complete Configuration Example

Systemd Unit File

OTel GenAI Semantic Conventions Reference

Span Types Relevant to b4arena

Key Attributes

Standard Metrics

Content Capture

GreptimeDB Auto-Created Tables

Traces Table: `opentelemetry_traces`

Logs Table: `opentelemetry_logs`

Per-Metric Tables

Cross-Signal Query Examples

Token Costs Per Agent Per Day

CPU Usage During a Specific Trace

Tool Call Frequency and Duration by Agent

Agent Session Timeline with Correlated System Metrics

Token Rate Per Agent (RANGE/ALIGN/FILL)

Container Resource Usage vs. Token Throughput