Which is better, LangSmith or Langfuse, for multi-agent systems?

Neither LangSmith nor Langfuse was designed for multi-agent delegation as a first-class concept. LangSmith added state diffs for LangGraph; Langfuse added Agent Graphs in November 2025. Both are retrofitted LLM monitors. LumiqTrace was built agent-first: every span carries agent identity and delegations are first-class spans.

LangSmith has a free Developer tier with 5,000 traces per month and 14-day retention. The Plus plan is $39 per seat per month with 10,000 base traces. Extended retention costs $5 per 1,000 additional traces.

LangSmith vs Langfuse vs LumiqTrace: Which Agent Observability Tool Is Right for You?

You're building AI agents. They're failing silently in production. Which tool actually tells you why?

If you've evaluated LLM observability tools, you've landed on LangSmith, Langfuse, or started hearing about LumiqTrace. All three claim to help you understand what your AI systems are doing. But they're built from fundamentally different assumptions — and if you're running agents rather than simple LLM chains, those assumptions matter more than any feature comparison table.

This post breaks down all three honestly: what each tool is good at, where it falls short, and which fits which team.

Quick Comparison

	LangSmith	Langfuse	LumiqTrace
Primary focus	LangChain tracing & evals	Open-source LLM observability	Agent-native observability
Agentic traces + delegation map	✗	✗	✓
Agent auto-discovery	✗	✗	✓
Built-in eval templates	0 (custom only)	0 (custom only)	12
AI cost optimizer	✗	✗	✓
AI ops assistant	✗	✗	✓ (LumiqPilot)
AI anomaly detection	✗	✗	✓
Setup time	~15 min	~20 min	< 5 min
Free tier	5K traces/mo	50K events/mo	10K traces/mo
Open source	✗	✓	✗
Self-hosted option	✗	✓	Enterprise only

LangSmith

LangSmith is LangChain's observability product. If your stack runs entirely on LangChain, the integration is tight — tracing is near-automatic when you use LangChain's abstractions.

Where LangSmith works well

Deep LangChain integration. If you're using LangGraph for multi-agent orchestration, LangSmith's integration is as close to native as you'll get. Traces attach automatically to chain executions, agent runs, and LangGraph nodes.

Evals within the same ecosystem. LangSmith's dataset management and annotation UI let you collect examples and run evals without leaving the LangChain environment. For teams doing iterative prompt engineering, this workflow is smooth.

Human feedback loop. Annotation workflows let you label good and bad traces, build golden datasets, and run regression tests against them.

Where LangSmith falls short

Lock-in. LangSmith works well if you use LangChain abstractions. If you're using OpenAI Agents SDK, CrewAI, AutoGen, or your own orchestration layer, instrumentation becomes manual and painful. Switching frameworks later means re-instrumenting from scratch.

No agentic traces. LangSmith records execution but spans don't carry agent identity or delegation context. Multi-agent handoffs aren't first-class — you can't see which agent delegated to which, what was passed, what came back. For a 3-hop agent system this is workable. For a production agent with 15 tool calls, two handoffs, and a planning loop, it falls apart.

No cost optimizer. You can see token counts per run. There's no automated analysis identifying which models could be swapped for cheaper alternatives without quality regression, or which prompts are running unnecessarily long.

Pricing at scale. At 1M+ traces per month, LangSmith moves to custom enterprise pricing. Teams at production scale often find costs exceed expectations.

Who should use LangSmith

Teams fully committed to LangChain/LangGraph who don't need multi-framework support, aren't hitting production scale where cost optimization matters, and want the tightest possible integration with the LangChain eval ecosystem.

Teams considering LangSmith alternatives beyond this comparison can see our full LangSmith alternatives guide.

Langfuse

Langfuse is an open-source LLM observability platform. The key word is "LLM" — it was designed for tracing language model calls, not the broader behavior of agent systems.

Where Langfuse works well

Open source and self-hostable. For teams with strict data residency requirements, compliance constraints, or a philosophical preference for open tooling, Langfuse is the only serious option in this category.

Generous free tier. 50K events per month on the hosted cloud is genuinely useful for early-stage products. You can get meaningful observability before spending anything.

Active community. Langfuse has good third-party content, a responsive community, and SDK support across Python, TypeScript, and major frameworks.

Flexible scoring. Langfuse's scoring API lets you attach custom evaluation scores to traces programmatically. If you're willing to build your own scoring functions, it's flexible.

Where Langfuse falls short

Retrofitted for agents. Langfuse was built for LLM tracing and extended to support agent concepts afterward. The architecture shows: traces are still organized around LLM calls, not agent decisions. Multi-agent delegation, tool registries, and agent planning loops are second-class citizens.

No built-in eval templates. Every evaluation function is written from scratch. For teams that need LLM-as-judge scoring working immediately — not after a sprint of eval engineering — this is a significant investment.

No cost optimizer. Token cost tracking exists at the trace level. There's no analysis layer identifying optimization opportunities across your agent fleet.

Self-hosting complexity. Running Langfuse yourself means managing Postgres, ClickHouse, Redis, and a Next.js application. It's not trivial. For small teams, this operational overhead can exceed the value of the data sovereignty.

Who should use Langfuse

Teams with non-negotiable data residency or open-source requirements, early-stage projects where the free hosted tier is sufficient, and teams with engineering capacity to build custom evals and manage self-hosted infrastructure.

For teams evaluating Langfuse alternatives, we have a dedicated Langfuse alternatives comparison.

LumiqTrace

LumiqTrace was built for agents from day one. Not retrofitted from LLM monitoring. The core architecture treats agent decisions, tool calls, and multi-agent handoffs as first-class primitives — not footnotes in an LLM trace.

What makes LumiqTrace different

Agentic traces. Every execution is traced with full agent identity — every span knows which agent owns it. Delegations are first-class spans: which agent handed off to which, what context was passed, what came back, latency and cost of the sub-execution. Auto-discovery builds a live agent map from real execution data. Competitors log spans. LumiqTrace traces agents.

Provider auto-patch + one framework handler. LumiqTrace init silently patches all LLM providers (OpenAI, Anthropic, Gemini, Bedrock, Mistral) — no changes to your LLM calls. Framework-level agent tracing adds one handler: LumiqtraceCallbackHandler() for LangChain, LumiqtraceCrewAIListener() for CrewAI, LumiqtraceADKHandler() for Google ADK. OpenAI Agents SDK is fully covered by the provider patch with no handler needed.

import lumiqtrace
lumiqtrace.init(api_key="YOUR_API_KEY")

import { lumiqtrace } from "@lumiqtrace/sdk";
lumiqtrace.init({ apiKey: process.env.LT_KEY });

That's it. Your agents are traced.

12 built-in eval templates. LLM-as-judge evaluation runs automatically on every trace. Templates cover faithfulness, relevance, toxicity, groundedness, instruction following, coherence, and more. You can customize scoring thresholds, but you don't build scoring functions from scratch.

AI cost optimizer. LumiqTrace analyzes your trace data to surface cost reduction opportunities: which models you could swap for cheaper alternatives without quality regression, which prompts are unnecessarily long, which agents re-execute work that could be cached. Real dollar amounts attached to real recommendations.

LumiqPilot. A conversational AI ops assistant built into your dashboard. Ask "why did costs spike?" — Pilot reads your live traces and surfaces the exact session, model, and deployment that caused it. From the same conversation, take action: create an alert, switch models, roll back a prompt — without leaving Pilot. On Scale, Pilot surfaces anomalies and cost opportunities proactively and can auto-remediate incidents based on rules you define.

AI anomaly detection. A statistical baseline is built from your traces automatically. When latency spikes, error rates shift, or cost patterns change, you get alerted before your users hit the problem.

Where LumiqTrace has trade-offs

Not open source. Teams with strict open-source requirements should use Langfuse.
Self-hosted is enterprise only. The default path is managed cloud.
Newer product. Less community content than LangSmith. Fewer third-party integrations at time of writing.

LumiqTrace pricing

Plan	Price	Traces/month
Free	$0	10,000
Solo	$39/mo	100,000
Pro	$149/mo	500,000
Team	$299/mo	2,000,000
Scale	Custom	10M+

The Agent-Specific Gap

Here's what most comparisons skip: there's a fundamental architectural difference between LLM observability and agent observability.

LLM observability tools answer: "What prompt was sent, what response came back, how long did it take, what did it cost?" That's useful for simple chatbots. It's insufficient for agents.

An agent handling a customer support query might:

Retrieve context from a knowledge base
Call a CRM API to check account history
Fail on the first tool call and retry with a reformulated query
Hand off to a specialized research agent
Receive a sub-result, reason over it, generate a response
Internally evaluate the response quality
Send it

That's a dozen operations across multiple models, several external APIs, and real decision-making happening throughout. To debug this effectively, you need:

Agentic traces with agent identity on every span — not just generic call trees
Cost attribution at the individual span level, not just at the request level
Tool call success and failure rates across your agent fleet over time
Automatic anomaly detection when behavior changes at any layer
The ability to query traces without writing complex filter syntax

LangSmith and Langfuse were designed before multi-agent systems with complex tool use were the norm. They track LLM calls well. LumiqTrace was designed when agents with tools, memory, and delegation were the baseline assumption.

Decision Guide

Use LangSmith if:

Your entire stack is LangChain or LangGraph and you want zero-friction integration
You need deep eval integration within the LangChain ecosystem specifically
You're not yet at the scale where cost optimization needs to be automated

Use Langfuse if:

Open source or self-hosting is a hard requirement (compliance, data residency)
You're early stage and the generous free tier fits your volume
You have engineering capacity to build custom evals and run self-hosted infra

Use LumiqTrace if:

You're running agents in production across multiple frameworks
You need to understand multi-agent delegation, not just LLM calls
Cost optimization and anomaly detection need to be automated, not manual
You want LLM-as-judge evals working immediately without writing scoring functions
Setup time and maintenance burden matter — you have a product to build

Frequently Asked Questions

What is the difference between LangSmith and Langfuse?

LangSmith is closed-source, LangChain-native, with 5,000 free traces per month. Langfuse is MIT-licensed and self-hostable, with 100,000 free units per month. LangSmith wins for LangChain/LangGraph depth; Langfuse wins for open-source compliance and self-hosting.

Which is better for multi-agent systems?

Neither was designed for multi-agent delegation as a first-class concept. LangSmith added LangGraph state diffs; Langfuse added Agent Graphs in November 2025. Both are retrofitted from LLM monitoring. LumiqTrace was built agent-first from day one: every span carries agent identity and delegations are first-class spans with full context.

Is LangSmith free?

The Developer tier includes 5,000 traces per month with 14-day retention and one seat. The Plus plan is $39 per seat per month with 10,000 base traces. Extended retention costs $5 per 1,000 additional traces.

Is Langfuse open source?

Yes. MIT license, fully open source since June 2025. All core features are available for self-hosting at no cost — but self-hosting requires ClickHouse, which costs $200–800 per month in infrastructure.

For a broader comparison including Helicone and others, see our AI agent observability tools overview.

LumiqTrace is free to start. 10,000 traces per month, no credit card required, setup under 5 minutes.