A trace is one of the more useful things you can capture from an LLM app, and many teams store it in a format they cannot move later. If your AI tracing uses its own private format, the spans sit in a silo, a backend change means re-instrumenting, and the AI traces never join the HTTP and database traces under one request. Making tracing OpenTelemetry-native from the start avoids that.
It helps to start with what traces give you that logs do not. The failures that matter most in an LLM app often do not throw an error. An agent returns a confident but wrong answer, nothing errors, and the log says the request finished in 4.2 seconds with a 200.
A trace shows the same request as a tree: an LLM call that planned three tool calls, a tool call whose return was an empty array, and a second LLM call that took that empty array and answered anyway. The trace shows the problem because it records the input and output of every step, not just that the step ran.
A log line for a tool call is a timestamp and a status. A tool span is the tool name, the arguments passed, the return payload, and the latency, all queryable, so you can filter to every empty tool return across a week of traffic.
That only works if the tracing layer models what an agent actually does. A generic OpenTelemetry install knows about HTTP requests and database queries, but it has no concept of an LLM call, a retrieval step, a tool invocation, or an agent that loops. A tracing layer for AI apps needs five span kinds:
- LLM spans: model, provider, input and output, token counts, temperature.
- Retriever spans: the query, the retrieved chunks, the embedding model, top-k. A wrong RAG answer usually comes from retrieval, not the model.
- Tool spans: the tool name, the JSON arguments, the return payload. Most silent agent failures show up here.
- Agent spans: the parent wrapping one agent's turn, so a multi-agent system shows which agent did what.
- Chain spans: one per multi-step sequence, giving the per-step view a single root span buries.
When an answer is wrong, you read the tree from the bottom: was the final LLM span ungrounded, did a tool return something empty, did the retriever pull the wrong chunks. The span kinds turn "the agent is broken" into a specific failing node.
The thing worth getting right early is the trace format itself. A tracing tool tied to a single AI-specific dialect works until the rest of your stack runs on OpenTelemetry and the AI spans cannot join it. OpenTelemetry-native tracing keeps the spans portable, so a backend change moves your data with you instead of leaving it behind. It puts your LLM and tool spans under the same trace ID as the HTTP and database spans your services already emit, so a slow request is one tree instead of two dashboards. And it exports over OTLP, the wire format your collector already handles, with no separate ingest path.
The library I use for this is traceAI: Apache 2.0, built on the OpenTelemetry SDK for Python and TypeScript, with the five span kinds above and 30+ provider and framework integrations, so for most stacks the instrumentation is an install and one register call.
When you compare tracing layers, four questions cover it:
Does it cover the span kinds your app produces?
A chat wrapper needs LLM spans. A RAG app needs retriever spans too. An agent needs tool and agent spans. A multi-step pipeline needs chain spans. A tracing layer that only models LLM calls leaves you blind on the steps that actually fail. traceAI covers all five. Check this first, because no amount of backend polish fixes a missing span kind.
Does it integrate with your frameworks?
Count the providers and frameworks in your stack, then check how many the library ships an integration for. Every gap is instrumentation you write and maintain by hand. TraceAI's 30+ integrations cover the common Python and TypeScript stacks; if yours is exotic, confirm the hand-emit path works for you.
How deep is the backend?
A trace tree you can look at is the floor. The useful question is what runs on top: can the backend score spans for correctness, cluster failures into named issues, attach guardrails inline, and feed an optimization loop. Tracing tells you what happened. The backend is where you find out whether it was right and what to do about it.
An Apache 2.0 instrumentation layer built on open OpenTelemetry means you can self-host, modify, and avoid a per-seat fee on the tracing itself. traceAI is Apache 2.0, and the instrumentation runs independently of the FAGI platform; point the exporter at a self-hosted OTel backend and no span leaves your infrastructure.
Once the spans land, the trace becomes the input to the rest of the loop. An evaluation layer scores each span for things like groundedness, context relevance, and task completion, and the scores attach back onto the spans so a failing one is filterable. The low-scoring spans get clustered into named failure modes, each with a likely root cause and a fix to try.
From there an optimizer acts on those failures: agent-opt ships six prompt optimizers that run against your eval results, and every run is human-initiated and gated by an evaluator rather than an automatic rewrite. The order is simple: traces produce the spans, evaluation scores them, clustering groups the failures, and optimization addresses them.
The trace format is the part that is hard to change later, so it is the one to settle first. The instrumentation can stay open and OpenTelemetry-native regardless of which backend you end up on, and the backend is a separate choice based on what it does with the trace once it lands.