Skip to content

@tank/rag-production

1.0.0

Description

Production RAG architecture covering chunking, embeddings, vector databases, hybrid search, reranking, advanced retrieval, evaluation, and production operations.

Triggered by

RAGretrieval augmented generationchunkingembeddingsvector databasehybrid search
Download
Verified
tank install @tank/rag-production

RAG Production

Core Philosophy

  1. Retrieval quality trumps generation quality — A perfect LLM cannot compensate for irrelevant context. Invest 80% of effort in the retrieval pipeline (chunking, indexing, search, reranking) before tuning prompts.
  2. Start naive, measure, then optimize — Begin with recursive character splitting + single vector search. Add hybrid search, reranking, and query expansion only when metrics prove the baseline insufficient.
  3. Chunk for retrieval, not for storage — Chunk boundaries determine what the model sees. Optimize chunk size and overlap for answer completeness, not disk efficiency.
  4. Evaluate continuously with RAGAS — Track faithfulness, context precision, context recall, and answer relevance on every pipeline change. Gut-feel evaluation hides regression.
  5. Cache aggressively, embed once — Embedding computation and reranking are the dominant costs. Cache embeddings at ingestion, cache retrieval results per query hash, batch embed during off-peak.

Quick-Start: Common Problems

"How do I build a basic RAG pipeline?"

  1. Load documents with appropriate loaders (PDF, markdown, HTML)
  2. Split into chunks: recursive character splitter, 512-1024 tokens, 10-20% overlap
  3. Embed chunks with text-embedding-3-small (cost-effective) or text-embedding-3-large (quality)
  4. Store in vector database (pgvector for existing Postgres, Pinecone for managed)
  5. Retrieve top-k (k=5-10) by cosine similarity on query embedding
  6. Construct prompt: system instructions + retrieved context + user query
  7. Generate with LLM, include source citations -> See references/chunking-strategies.md and references/embedding-models.md

"Retrieval returns irrelevant results"

  1. Check chunk size — too large buries signal, too small loses context
  2. Add hybrid search: combine BM25 full-text + vector similarity with RRF
  3. Add a reranker (Cohere Rerank or cross-encoder) as second-stage filter
  4. Try HyDE: generate a hypothetical answer, embed that instead of the raw query
  5. Verify embedding model matches your domain (multilingual, code, etc.) -> See references/retrieval-patterns.md and references/reranking.md

"RAG is too expensive in production"

  1. Use text-embedding-3-small with reduced dimensions (512d via Matryoshka)
  2. Cache embeddings — never re-embed unchanged documents
  3. Cache retrieval results by query hash (TTL: 5-60 minutes)
  4. Use tiered retrieval: cheap vector search first, expensive reranker only on top-50
  5. Batch embedding calls during ingestion, not per-request -> See references/production-operations.md

"How do I evaluate my RAG pipeline?"

  1. Generate a test set: 50-100 question-answer pairs from your corpus
  2. Run RAGAS metrics: faithfulness, context precision, context recall, answer relevance
  3. Baseline first, then measure each pipeline change independently
  4. Set quality gates: faithfulness > 0.85, context precision > 0.75 -> See references/evaluation.md

Decision Trees

Vector Database Selection

SignalRecommendation
Already using PostgreSQLpgvector (simplest ops, hybrid via tsvector)
Managed, zero-ops requiredPinecone Serverless
Need native hybrid search + multi-tenancyWeaviate
Maximum performance, self-hostedQdrant
Local dev / prototypingChroma
Enterprise, massive scaleMilvus or MongoDB Atlas Vector Search

Chunking Strategy

Content TypeStrategy
Prose documents (articles, reports)Recursive character, 512-1024 tokens, 10% overlap
Markdown / structured docsMarkdown header splitter (preserve hierarchy)
Code repositoriesLanguage-aware splitter (function/class boundaries)
Legal / regulatorySemantic chunking (sentence-transformer boundaries)
FAQ / Q&A pairsKeep each Q&A as a single chunk
Tables / structured dataKeep table intact, embed with surrounding context

Retrieval Strategy

SignalApproach
Baseline / starting pointSingle vector search, top-k=5
Keyword-heavy queries failAdd BM25 hybrid search + RRF
Top-k results contain noiseAdd reranker (Cohere Rerank or cross-encoder)
Short/ambiguous queriesHyDE or multi-query expansion
Complex multi-part questionsQuery decomposition → parallel retrieval → merge
Agent needs selective retrievalAgentic RAG with tool-use

Reference Index

FileContents
references/chunking-strategies.mdFixed, recursive, semantic, document-aware, parent-child, and hierarchical chunking with size/overlap guidance
references/embedding-models.mdOpenAI, Cohere, open-source models, dimensionality reduction, Matryoshka, quantization, benchmarks
references/vector-databases.mdpgvector, Pinecone, Weaviate, Qdrant, Chroma, Milvus comparison with indexing (HNSW, IVF), ops trade-offs
references/retrieval-patterns.mdHybrid search (BM25 + vector), metadata filtering, HyDE, multi-query, query decomposition, contextual compression
references/reranking.mdCohere Rerank, cross-encoders, ColBERT, reciprocal rank fusion, MMR diversity, scoring pipelines
references/context-assembly.mdPrompt construction, citation/attribution, lost-in-the-middle, context window budgeting, source deduplication
references/evaluation.mdRAGAS metrics (faithfulness, precision, recall), DeepEval, test set generation, quality gates, regression testing
references/advanced-rag.mdAgentic RAG, self-RAG, CRAG, graph RAG (GraphRAG, RAPTOR), multimodal RAG, late chunking
references/production-operations.mdCaching (embedding, retrieval, LLM), streaming, cost optimization, monitoring, ingestion pipelines, scaling

Command Palette

Search skills, docs, and navigate Tank