LLM Observability

Name: @tank/llm-observability
Author: Elad Ben Haim

Core Philosophy

Trace what matters, not everything blindly — Good observability starts with meaningful spans, metadata, and outcomes, not a pile of noisy logs.
Evaluation is part of the product loop — Prompts, retrieval, latency, and cost should be measured as continuously as code regressions.
Prompt changes need versioning and evidence — Never ship prompt edits without a way to compare behavior, cost, and failure rate.
Human feedback and automated scores complement each other — Neither alone is enough for trustworthy LLM systems.
Cost, latency, and quality trade off together — A “better” prompt or model is not better if it wrecks budgets or user response time.

Trace request → retrieval → prompt → model → post-processing
Capture prompt version, model, latency, token usage, and user/session context
Log enough artifacts to reproduce failures safely -> See references/tracing-and-spans.md

Need	Approach
quick regression check	saved eval dataset + side-by-side scores
RAG quality check	retrieval + answer metrics
production rollout	prompt versioning + experiment comparison
-> See `references/evaluation-and-regressions.md`

Track tokens, latency, and model choice per route/use case
Compare prompt versions and model routing costs
Set budget alerts and route-level cost review -> See references/cost-latency-and-ops.md

Signal	Recommendation
prompt/version + tracing focus	Langfuse
LangChain-heavy stack and experiments	LangSmith
open-source tracing/eval focus	Phoenix
lightweight gateway-style monitoring	Helicone

Signal	Use
deterministic task with clear labels	dataset + exact/assertive metrics
subjective generation quality	rubric or model-graded evals
RAG system	retrieval metrics + answer quality metrics

File	Contents
`references/tracing-and-spans.md`	Trace design, spans, generations, metadata, correlation IDs, prompt/model/version capture
`references/evaluation-and-regressions.md`	datasets, prompt experiments, regression gates, RAGAS/DeepEval-style thinking, score interpretation
`references/prompt-versioning-and-feedback.md`	prompt registries, version promotion, human feedback capture, annotation workflows
`references/cost-latency-and-ops.md`	token/cost tracking, latency budgets, routing decisions, production dashboards, alerts
`references/platform-selection.md`	Langfuse, LangSmith, Phoenix, Helicone, Braintrust trade-offs and deployment patterns