RAG Production

Name: @tank/rag-production
Author: Elad Ben Haim

Core Philosophy

Retrieval quality trumps generation quality — A perfect LLM cannot compensate for irrelevant context. Invest 80% of effort in the retrieval pipeline (chunking, indexing, search, reranking) before tuning prompts.
Start naive, measure, then optimize — Begin with recursive character splitting + single vector search. Add hybrid search, reranking, and query expansion only when metrics prove the baseline insufficient.
Chunk for retrieval, not for storage — Chunk boundaries determine what the model sees. Optimize chunk size and overlap for answer completeness, not disk efficiency.
Evaluate continuously with RAGAS — Track faithfulness, context precision, context recall, and answer relevance on every pipeline change. Gut-feel evaluation hides regression.
Cache aggressively, embed once — Embedding computation and reranking are the dominant costs. Cache embeddings at ingestion, cache retrieval results per query hash, batch embed during off-peak.

Quick-Start: Common Problems

"How do I build a basic RAG pipeline?"

Load documents with appropriate loaders (PDF, markdown, HTML)
Split into chunks: recursive character splitter, 512-1024 tokens, 10-20% overlap
Embed chunks with text-embedding-3-small (cost-effective) or text-embedding-3-large (quality)
Store in vector database (pgvector for existing Postgres, Pinecone for managed)
Retrieve top-k (k=5-10) by cosine similarity on query embedding
Construct prompt: system instructions + retrieved context + user query
Generate with LLM, include source citations -> See references/chunking-strategies.md and references/embedding-models.md

"Retrieval returns irrelevant results"

Check chunk size — too large buries signal, too small loses context
Add hybrid search: combine BM25 full-text + vector similarity with RRF
Add a reranker (Cohere Rerank or cross-encoder) as second-stage filter
Try HyDE: generate a hypothetical answer, embed that instead of the raw query
Verify embedding model matches your domain (multilingual, code, etc.) -> See references/retrieval-patterns.md and references/reranking.md

"RAG is too expensive in production"

Use text-embedding-3-small with reduced dimensions (512d via Matryoshka)
Cache embeddings — never re-embed unchanged documents
Cache retrieval results by query hash (TTL: 5-60 minutes)
Use tiered retrieval: cheap vector search first, expensive reranker only on top-50
Batch embedding calls during ingestion, not per-request -> See references/production-operations.md

"How do I evaluate my RAG pipeline?"

Generate a test set: 50-100 question-answer pairs from your corpus
Run RAGAS metrics: faithfulness, context precision, context recall, answer relevance
Baseline first, then measure each pipeline change independently
Set quality gates: faithfulness > 0.85, context precision > 0.75 -> See references/evaluation.md

Decision Trees

Vector Database Selection

Signal	Recommendation
Already using PostgreSQL	pgvector (simplest ops, hybrid via tsvector)
Managed, zero-ops required	Pinecone Serverless
Need native hybrid search + multi-tenancy	Weaviate
Maximum performance, self-hosted	Qdrant
Local dev / prototyping	Chroma
Enterprise, massive scale	Milvus or MongoDB Atlas Vector Search

Chunking Strategy

Content Type	Strategy
Prose documents (articles, reports)	Recursive character, 512-1024 tokens, 10% overlap
Markdown / structured docs	Markdown header splitter (preserve hierarchy)
Code repositories	Language-aware splitter (function/class boundaries)
Legal / regulatory	Semantic chunking (sentence-transformer boundaries)
FAQ / Q&A pairs	Keep each Q&A as a single chunk
Tables / structured data	Keep table intact, embed with surrounding context

Retrieval Strategy

Signal	Approach
Baseline / starting point	Single vector search, top-k=5
Keyword-heavy queries fail	Add BM25 hybrid search + RRF
Top-k results contain noise	Add reranker (Cohere Rerank or cross-encoder)
Short/ambiguous queries	HyDE or multi-query expansion
Complex multi-part questions	Query decomposition → parallel retrieval → merge
Agent needs selective retrieval	Agentic RAG with tool-use

Reference Index

File	Contents
`references/chunking-strategies.md`	Fixed, recursive, semantic, document-aware, parent-child, and hierarchical chunking with size/overlap guidance
`references/embedding-models.md`	OpenAI, Cohere, open-source models, dimensionality reduction, Matryoshka, quantization, benchmarks
`references/vector-databases.md`	pgvector, Pinecone, Weaviate, Qdrant, Chroma, Milvus comparison with indexing (HNSW, IVF), ops trade-offs
`references/retrieval-patterns.md`	Hybrid search (BM25 + vector), metadata filtering, HyDE, multi-query, query decomposition, contextual compression
`references/reranking.md`	Cohere Rerank, cross-encoders, ColBERT, reciprocal rank fusion, MMR diversity, scoring pipelines
`references/context-assembly.md`	Prompt construction, citation/attribution, lost-in-the-middle, context window budgeting, source deduplication
`references/evaluation.md`	RAGAS metrics (faithfulness, precision, recall), DeepEval, test set generation, quality gates, regression testing
`references/advanced-rag.md`	Agentic RAG, self-RAG, CRAG, graph RAG (GraphRAG, RAPTOR), multimodal RAG, late chunking
`references/production-operations.md`	Caching (embedding, retrieval, LLM), streaming, cost optimization, monitoring, ingestion pipelines, scaling

@tank/rag-production

Description

Triggered by

RAG Production

Core Philosophy

Quick-Start: Common Problems

"How do I build a basic RAG pipeline?"

"Retrieval returns irrelevant results"

"RAG is too expensive in production"

"How do I evaluate my RAG pipeline?"

Decision Trees

Vector Database Selection

Chunking Strategy

Retrieval Strategy

Reference Index

@tank/rag-production

Description

Triggered by

RAG Production

Core Philosophy

Quick-Start: Common Problems

"How do I build a basic RAG pipeline?"

"Retrieval returns irrelevant results"

"RAG is too expensive in production"

"How do I evaluate my RAG pipeline?"

Decision Trees

Vector Database Selection

Chunking Strategy

Retrieval Strategy

Reference Index

Command Palette