Skip to content

@tank/llm-observability

1.0.0

Description

Production LLM observability and evaluation patterns covering tracing, prompt versioning, regression testing, RAG evaluation, cost and latency monitoring, dashboards, and platform trade-offs across Langfuse, LangSmith, Phoenix, Helicone, and related tooling..

Download
Verified
tank install @tank/llm-observability

LLM Observability

Core Philosophy

  1. Trace what matters, not everything blindly — Good observability starts with meaningful spans, metadata, and outcomes, not a pile of noisy logs.
  2. Evaluation is part of the product loop — Prompts, retrieval, latency, and cost should be measured as continuously as code regressions.
  3. Prompt changes need versioning and evidence — Never ship prompt edits without a way to compare behavior, cost, and failure rate.
  4. Human feedback and automated scores complement each other — Neither alone is enough for trustworthy LLM systems.
  5. Cost, latency, and quality trade off together — A “better” prompt or model is not better if it wrecks budgets or user response time.

Quick-Start: Common Problems

"We can’t debug bad LLM outputs"

  1. Trace request → retrieval → prompt → model → post-processing
  2. Capture prompt version, model, latency, token usage, and user/session context
  3. Log enough artifacts to reproduce failures safely -> See references/tracing-and-spans.md

"How do we evaluate prompt changes safely?"

NeedApproach
quick regression checksaved eval dataset + side-by-side scores
RAG quality checkretrieval + answer metrics
production rolloutprompt versioning + experiment comparison
-> See references/evaluation-and-regressions.md

"Costs are climbing fast"

  1. Track tokens, latency, and model choice per route/use case
  2. Compare prompt versions and model routing costs
  3. Set budget alerts and route-level cost review -> See references/cost-latency-and-ops.md

Decision Trees

Observability Platform Choice

SignalRecommendation
prompt/version + tracing focusLangfuse
LangChain-heavy stack and experimentsLangSmith
open-source tracing/eval focusPhoenix
lightweight gateway-style monitoringHelicone

Evaluation Strategy

SignalUse
deterministic task with clear labelsdataset + exact/assertive metrics
subjective generation qualityrubric or model-graded evals
RAG systemretrieval metrics + answer quality metrics

Reference Index

FileContents
references/tracing-and-spans.mdTrace design, spans, generations, metadata, correlation IDs, prompt/model/version capture
references/evaluation-and-regressions.mddatasets, prompt experiments, regression gates, RAGAS/DeepEval-style thinking, score interpretation
references/prompt-versioning-and-feedback.mdprompt registries, version promotion, human feedback capture, annotation workflows
references/cost-latency-and-ops.mdtoken/cost tracking, latency budgets, routing decisions, production dashboards, alerts
references/platform-selection.mdLangfuse, LangSmith, Phoenix, Helicone, Braintrust trade-offs and deployment patterns

Command Palette

Search skills, docs, and navigate Tank