CrewAI Observability: How to Monitor Crews in Production
Running a CrewAI crew in development is straightforward. You define a few agents with roles and goals, wire up some tasks, call crew.kickoff(), and read the output. When something breaks you read the console.
Production is different. Multiple agents run in sequence or in parallel. A manager LLM delegates tasks to worker agents. Tools get retried silently. One agent feeds output to the next, and a quality problem anywhere in the chain produces a bad final result — with no indication of where the failure started.
This post covers how to add production observability to CrewAI crews with LumiqTrace: full per-agent traces, delegation spans, cost breakdown per task, eval templates for output quality, and automated anomaly detection. For background on what agent observability captures and why it matters, see What is AI Agent Observability.
Why CrewAI is hard to debug in production
These are the specific failure modes engineers hit when running CrewAI in production:
You don't know which agent in the crew got stuck. A crew run times out or returns a degraded result. Was it the researcher, the writer, or the tool call? The crew.kickoff() call blocks and returns one result — there is no native signal about which agent was the bottleneck.
Task retries are invisible. CrewAI retries failed tasks internally. If a tool call returned stale data and the agent retried it twice before succeeding, you'll never know from the final output — but you paid for three LLM calls.
Hierarchical delegation failures are opaque. When you use Process.hierarchical, a manager LLM decides which worker agent to assign each task to. If the manager delegates to the wrong agent, or writes ambiguous instructions that cause the worker to hallucinate, there is no record of what delegation decision was made.
You can't see cost per agent. A crew run costs $0.089. Is that the researcher burning $0.051 on web search synthesis, or the writer making four revision passes? Without per-agent cost data, you can't optimize.
Output quality is binary. The crew either returns something or it doesn't. There is no built-in signal for whether the output was faithful to the source material, whether each agent met its expected_output spec, or whether the task instructions were followed precisely.
Setup: 3 lines
import lumiqtrace
from lumiqtrace.integrations import LumiqtraceCrewAIListener
lumiqtrace.init(api_key="YOUR_API_KEY")
LumiqtraceCrewAIListener() # instantiating registers it — no other args needed
// Node.js / TypeScript (for services that interact with CrewAI)
import { lumiqtrace } from "@lumiqtrace/sdk";
lumiqtrace.init({ apiKey: process.env.LT_KEY });
LumiqTrace auto-patches all LLM provider calls on init. LumiqtraceCrewAIListener() hooks into CrewAI's event system — agent identity, task spans, and delegation are captured automatically from here. No decorators, no manual span creation, no changes to your agent or task definitions.
Here's the full crew code for reference — none of this changes:
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool
researcher = Agent(
role="Research Analyst",
goal="Find accurate information about the topic",
backstory="You are an expert researcher with 10 years of experience.",
tools=[SerperDevTool()]
)
writer = Agent(
role="Content Writer",
goal="Write clear, accurate content based on research",
backstory="You write technical content for engineering audiences."
)
research_task = Task(
description="Research the latest developments in {topic}",
expected_output="A factual summary with 5 key findings",
agent=researcher
)
write_task = Task(
description="Write a technical blog post based on the research",
expected_output="A 500-word technical post",
agent=writer
)
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_task],
process=Process.sequential
)
To group traces by job or crew instance, add a context wrapper:
with lumiqtrace.context(crew_id="research_crew", job_id="job_789"):
result = crew.kickoff(inputs={"topic": user_topic})
await lumiqtrace.withContext({ crewId: "research_crew", jobId: "job_789" }, async () => {
const result = await crew.kickoff({ inputs: { topic: userTopic } });
});
What you see in the dashboard
After your first crew run, LumiqTrace renders an agentic trace that shows the full execution tree. Every span carries agent identity — you know immediately which agent generated which span. For hierarchical crews, manager delegations are first-class spans with the full instruction and the selected worker.
Here is what the trace looks like for the research + writing crew above:
Crew: Research + Writing [4.2s, $0.089]
├── Manager LLM: task delegation [220ms, $0.008]
├── Agent: Research Analyst [2.1s, $0.051]
│ ├── Tool: SerperDevTool [380ms]
│ ├── Tool: SerperDevTool [290ms] ← retry after first returned stale results
│ └── LLM: gpt-4o synthesis [1.4s, $0.047]
└── Agent: Content Writer [1.9s, $0.038]
├── LLM: gpt-4o draft [1.1s, $0.024]
├── LLM: gpt-4o-mini quality check [0.4s, $0.008]
└── LLM: gpt-4o revision [0.4s, $0.006]
Each span includes: agent role, task description, input/output content, model used, token counts, latency, and cost. Delegations in hierarchical crews show the manager's instructions verbatim alongside the worker's response.
Debugging a crew failure step by step
Suppose a crew run returns a blog post that is factually wrong. Here is how to work through it in LumiqTrace:
Step 1: Open the trace and check the Research Analyst span. Expand the agent span and look at the tool call outputs. In this example, the first SerperDevTool call returned stale data — you can see the raw response in the span. The retry returned fresher results, but the agent used both in its synthesis prompt. That's the source of the factual error.
Step 2: Check the synthesis LLM call input. The LLM input for the gpt-4o synthesis span shows the agent's prompt with both tool responses concatenated. The older result contradicts the newer one, and the model picked the wrong one.
Step 3: Check the Content Writer's input. The writer's LLM: gpt-4o draft span shows the exact string passed from the researcher as context. Because the researcher's output was wrong, the writer faithfully reproduced the error.
Step 4: Check latency for regression signals. The researcher took 2.1s. Filter to the last 50 runs — if typical researcher latency is 1.4s, the extra 670ms came from the tool retry that indicates a data-freshness problem.
The fix: add a tool result validation step to the researcher's task, or deduplicate tool results before synthesis. Both are visible because the trace shows every intermediate step.
Cost breakdown per agent and per task
LumiqTrace rolls up token cost at every LLM call and attributes it to the agent that made it. For the crew above:
- Manager LLM delegation: $0.008 (3% of run cost)
- Research Analyst: $0.051 (57% of run cost — gpt-4o synthesis is the driver)
- Content Writer: $0.038 (43% of run cost — three LLM passes for draft, check, and revision)
At scale, a crew that runs 10,000 times per month spends $890/month at this rate. LumiqTrace's AI cost optimizer surfaces which agent model swaps would preserve output quality while reducing cost. In this case, moving the writer's quality check from gpt-4o to gpt-4o-mini (already done here) saves $0.016 per run — $160/month at volume.
You can filter the cost breakdown by time window, by job_id, and by task to understand how cost changes as your input complexity changes. For a full feature comparison of CrewAI-compatible observability tools, see the AI agent observability tools overview.
Evaluating crew output quality
LumiqTrace ships 12 built-in eval templates. For CrewAI crews, three are most relevant:
Faithfulness: Did the writer's output stay grounded in the researcher's findings, or did the model introduce unsupported claims? LumiqTrace compares the writer's output span against the researcher's output span and scores fidelity.
Task completion: Did each agent produce output that matches the expected_output field from its task definition? This catches agents that return partial results or change format under load.
Instruction following: Did the agent's output follow the explicit instructions in the task description? Useful for catching prompt drift when task descriptions are parameterized with user input.
Evals run automatically on each trace and appear as a score column in the trace list. You can set threshold alerts — for example, flag any crew run where faithfulness drops below 0.8 — so regressions surface before users report them.
Teams evaluating LangSmith for CrewAI monitoring should read our LangSmith alternatives guide — LangSmith's instrumentation doesn't extend cleanly to non-LangChain frameworks. For a side-by-side comparison of all leading observability tools including CrewAI support, see the LangSmith vs Langfuse vs LumiqTrace comparison.
LumiqPilot for crew ops
LumiqPilot is LumiqTrace's AI ops assistant embedded in the dashboard. It has three capabilities:
Deep data analysis. Ask questions against your live trace data in plain language: "Which agent in my research crew has the highest p95 latency over the last 7 days?" or "What is the average cost per crew run when the input topic is longer than 50 words?" LumiqPilot queries your trace store and returns a specific answer with the supporting data.
Instant action from insight. When LumiqPilot surfaces a finding, you can act on it in one click. If it detects that your researcher agent's latency spiked 40% after a model version change, you can create an alert or switch the agent's model directly from the chat — no dashboard navigation required.
Proactive auto-remediation (Scale plan). LumiqPilot monitors your crew behavior continuously and takes pre-approved actions without waiting for you to ask. If the researcher agent starts returning tool errors at a rate above your configured threshold, LumiqPilot can automatically fall back to a backup tool, page on-call, or adjust retry settings — then show you what it did and why.
Anomaly detection for crew behavior
LumiqTrace's AI anomaly detection builds a behavioral baseline for each agent in your crew across latency, cost, token usage, and eval scores. It then flags statistically significant deviations.
Examples of what it catches in CrewAI crews:
- The Research Analyst's tool retry rate doubles — usually indicates a data source degradation before it causes full failures
- The Content Writer's output token count spikes 3x — often a prompt injection via user-supplied topic input
- Crew run cost increases 60% after a CrewAI version upgrade — a dependency change introduced extra LLM calls that weren't visible in the output
Anomalies surface in the dashboard and can trigger alerts to Slack, PagerDuty, or webhooks. Detection runs on every trace with no configuration required.
Frequently Asked Questions
Does LumiqTrace require changes to my CrewAI agent or task definitions?
No. LumiqTrace auto-discovers all agents, tasks, tools, and LLM calls through instrumentation. Your existing crew code stays unchanged — add lumiqtrace.init() and LumiqtraceCrewAIListener() at startup.
Does LumiqTrace work with both sequential and hierarchical CrewAI processes?
Yes. For sequential crews, each task handoff is a first-class span. For hierarchical crews, manager LLM delegations are recorded as explicit spans with full context — which worker received the delegation, what instructions were sent, and what was returned.
Can I track cost per agent in a CrewAI crew?
Yes. LumiqTrace records token usage and cost at every LLM call, then rolls it up per agent, per task, and per crew run. You can see exactly which agent is driving cost and compare across runs.
How do I evaluate CrewAI output quality?
LumiqTrace ships 12 built-in eval templates. For CrewAI, the most relevant are faithfulness (did the writer's output stay grounded in the researcher's findings?), task completion (did the agent meet the expected_output spec?), and instruction following.
Start monitoring your CrewAI crew
Add three lines to your crew code, run a kickoff, and your first traces appear in the dashboard within seconds.
LumiqTrace Free tier: $0, 10,000 traces per month, no credit card required. Evals, cost breakdown, and anomaly detection are included on all plans.
Start free — 10K traces/month, no card needed
See every agent decision, tool call, and handoff in production. Setup takes under 5 minutes.
Get started free →