What to Log in an AI Agent System If You Want Debugging to Be Possible Later

2026-04-29 • Logging and observability guide • Butler

If you cannot reconstruct one full agent run later, you do not really have AI observability. Here is the practical logging baseline that makes debugging and postmortems possible.

The Butler reviewing a detailed operational logbook, representing AI agent traces, audit trails, and incident review

Most teams start AI agent logging the same way they start normal app logging. They log a few requests, a few errors, maybe the final output, then assume they can figure out the rest later.

That breaks fast.

In agent systems, the incident is rarely one model call. It is the chain of model calls, retrieval steps, tool actions, retries, approvals, and fallbacks. If you cannot reconstruct that chain later, you do not really have post-incident review. You have expensive noise.

The practical standard is simpler than it sounds. Log enough to rebuild one agent run from start to finish, correlate every important step, and understand exactly where the failure boundary was. If you need the broader foundation first, start with What Is an AI Agent in 2026?.

Normal app logs are not enough for agents

A normal service request usually has a cleaner shape. Input comes in, code runs, output goes back.

An agent run is messier. One request may fan out into multiple model calls, a retrieval step against changing data, one or more tool executions, a policy check, a human approval pause, a retry, and a fallback. The final bad answer may be caused by any one of those layers, or by the interaction between them.

That is why agent logging should be trace-first, not line-log-first.

If your team cannot open one trace and see the full decision path, you do not have AI observability yet. This is also why practical ops work tends to overlap with pieces like The 7 Failure Checks Every AI Agent Workflow Should Run Before Production. Logging is not separate from reliability. It is how reliability becomes reviewable.

The minimum useful logging unit is one reconstructable run

Think in terms of a run, not a message.

Every run should have a traceable top-level record with linked spans for:

model calls
tool execution
retrieval or RAG steps
workflow transitions
approvals and handoffs
exceptions, retries, and fallbacks

That one run should answer the questions operators actually ask during incidents:

What request was this?
Which agent and version handled it?
Which prompt or instruction version ran?
Which model alias was requested, and which model actually answered?
Which tools fired, with what arguments and outcomes?
Did retrieval return the wrong context, or was the context fine and the model failed anyway?
Did a person approve, reject, or take over?
Where exactly did the run slow down, truncate, or fail?

If those answers live across five systems with no shared trace ID, your logging strategy is already failing.

The 7 things to log on every run

1. Stable identifiers and versions

Every run needs durable IDs that survive across services.

Log trace_id, run_id, span_id, parent span, session or conversation ID, user or tenant identifier in safe form, plus the agent name, agent ID, and agent version. Also log prompt template version, system instruction version, tool version, workflow step name, environment, and region.

Postmortems fail quickly when nobody can answer which prompt, model alias, or agent build actually ran.

2. Model request and response metadata

You should always log the operational metadata around model calls.

That includes provider, endpoint, requested model, actual response model, temperature, top_p, max_tokens, seed when used, and whether streaming was enabled. Also capture finish reason, latency, and time to first chunk.

A bad answer is not always a reasoning problem. Sometimes it is a timeout, a token ceiling, a streaming break, or a hidden model-route change.

3. Token, cost, and performance data

AI incidents are often performance incidents.

Log input tokens, output tokens, reasoning tokens when exposed, cache reads and writes when available, estimated cost, retry counts, and total run duration. If you track budgets or escalation paths, tie those records into the same run. That is the operational companion to How to Set Budgets, Rate Limits, and Escalation Rules for AI Agent Workflows.

Without this layer, teams misdiagnose expensive or slow runs as model quality issues.

4. Tool calls as first-class records

If tool calls can change the world, they need better logs than your model completions do.

For each tool call, log the tool name and version, calling span, structured arguments with redaction where needed, start and end time, approval state, result code, retry history, and side-effect target such as recipient ID, record ID, ticket ID, or order ID.

Do not bury a database write, outbound message, or destructive action inside a vague debug string. Tools are where auditability becomes real.

5. Retrieval and context assembly

If the agent uses RAG, log the retrieval step separately.

Capture the retriever name and version, query or query hash depending on sensitivity, corpus or index version, top-k settings, returned document IDs or chunk IDs, relevance scores when available, and which citations were passed downstream.

This matters because a confident wrong answer may be caused by bad retrieval, stale indexing, or missing context rather than the model itself.

6. Approvals, handoffs, and policy decisions

A lot of important operational behavior happens outside the model call.

Log whether the run required approval, who approved or rejected it, what policy gate triggered the pause, whether a human took over, and where the handoff happened. These records should line up with the same trace as the rest of the run, not sit in a separate ticketing or chat silo.

If your workflow uses selective human review, this should connect directly with your human-in-the-loop approval patterns for AI operations and the practical handoff choices in The Best Human Handoff Points in an AI Workflow.

7. Failures, exceptions, and fallback boundaries

Do not log every failure as LLM error and call it done.

You need a low-cardinality error type, provider error code or HTTP status, retry and backoff history, last successful step, fallback invoked or not, and whether human escalation occurred. Exception messages and stack traces can be useful, but only under policy because they may contain secrets or user data.

The real value is classification. Teams need to distinguish model failure from tool failure, retrieval failure, orchestration failure, timeout, rate limit, policy block, and operator interruption.

What should be content-logged versus metadata-only

This is where teams often get reckless.

The goal is not to log everything. It is to log enough to reconstruct what mattered without turning your observability stack into a data leak.

Always log

identifiers and versions
request parameters
timings and latency
token and cost data
tool status and result codes
error classes
approval and handoff status

Log conditionally

raw user prompts
full system prompts
retrieved snippets
tool arguments or results containing business data
exception text that may expose secrets or PII

The safer default

Use metadata plus redacted content by default. Allow raw payload capture only for explicit debug sessions, sampled incidents, or tightly governed short-retention paths. When full content is unnecessary, store hashes or fingerprints so teams can compare runs without copying sensitive material everywhere.

Log content by policy tier, not by accident.

Common mistakes that make incident review impossible

A few failure patterns show up constantly:

only logging the final answer
missing a shared trace or run ID across model and tool layers
storing prompt text but not prompt version
logging the requested model but not the actual response model
leaving retrieval outside the main trace
recording only stack traces and no searchable error class
logging secrets or customer content by default
keeping approvals and human handoffs in a separate system with no correlation

Observability without correlation IDs is just expensive noise.

A copy-paste production checklist

Use this as the minimum bar before you call an agent system debuggable:

Can we open one trace and see the full run path?
Can we tell exactly which model, prompt version, and agent version ran?
Can we see each tool call, its approval path, and its result?
Can we distinguish model failure from tool failure from orchestration failure?
Can we explain latency, truncation, and cost spikes from logs alone?
Can we review the run without exposing secrets by default?
Can we tell what retrieval context was used?
Can we see whether fallback or human escalation happened?
Can we reconstruct the final answer from prior steps, or prove why we cannot?
Do metadata and raw content have different redaction and retention rules?

That is the real threshold. Not perfect telemetry. Reconstructable operations.

Bottom line

Every agent run should be reconstructable, explainable, and reviewable.

If your team can answer who, what, which version, which tool, which context, and what failed from one trace, you are in decent shape. If not, debugging will turn into guesswork the first time an agent does something expensive, risky, or weird.

That is the practical standard worth building toward.

Sources

OpenTelemetry, Generative AI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenTelemetry, GenAI events: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/
OpenTelemetry, GenAI agent and framework spans: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
OpenTelemetry, GenAI exceptions: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-exceptions/
OWASP Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/
LangSmith observability docs: https://docs.langchain.com/langsmith/observability
Arize Phoenix, What are traces: https://arize.com/docs/phoenix/tracing/concepts-tracing/what-are-traces

Related coverage

AI Disclosure

This article was researched and drafted with AI assistance, then edited and structured for publication by a human. Logging policy, retention, and redaction decisions should still be set by the team operating the system.