@tank/rag-production
1.0.0Description
Production RAG architecture covering chunking, embeddings, vector databases, hybrid search, reranking, advanced retrieval, evaluation, and production operations.
Triggered by
RAGretrieval augmented generationchunkingembeddingsvector databasehybrid search
Download
Verified
tank install @tank/rag-productionRAG Production
Core Philosophy
- Retrieval quality trumps generation quality — A perfect LLM cannot compensate for irrelevant context. Invest 80% of effort in the retrieval pipeline (chunking, indexing, search, reranking) before tuning prompts.
- Start naive, measure, then optimize — Begin with recursive character splitting + single vector search. Add hybrid search, reranking, and query expansion only when metrics prove the baseline insufficient.
- Chunk for retrieval, not for storage — Chunk boundaries determine what the model sees. Optimize chunk size and overlap for answer completeness, not disk efficiency.
- Evaluate continuously with RAGAS — Track faithfulness, context precision, context recall, and answer relevance on every pipeline change. Gut-feel evaluation hides regression.
- Cache aggressively, embed once — Embedding computation and reranking are the dominant costs. Cache embeddings at ingestion, cache retrieval results per query hash, batch embed during off-peak.
Quick-Start: Common Problems
"How do I build a basic RAG pipeline?"
- Load documents with appropriate loaders (PDF, markdown, HTML)
- Split into chunks: recursive character splitter, 512-1024 tokens, 10-20% overlap
- Embed chunks with
text-embedding-3-small(cost-effective) ortext-embedding-3-large(quality) - Store in vector database (pgvector for existing Postgres, Pinecone for managed)
- Retrieve top-k (k=5-10) by cosine similarity on query embedding
- Construct prompt: system instructions + retrieved context + user query
- Generate with LLM, include source citations
-> See
references/chunking-strategies.mdandreferences/embedding-models.md
"Retrieval returns irrelevant results"
- Check chunk size — too large buries signal, too small loses context
- Add hybrid search: combine BM25 full-text + vector similarity with RRF
- Add a reranker (Cohere Rerank or cross-encoder) as second-stage filter
- Try HyDE: generate a hypothetical answer, embed that instead of the raw query
- Verify embedding model matches your domain (multilingual, code, etc.)
-> See
references/retrieval-patterns.mdandreferences/reranking.md
"RAG is too expensive in production"
- Use
text-embedding-3-smallwith reduced dimensions (512d via Matryoshka) - Cache embeddings — never re-embed unchanged documents
- Cache retrieval results by query hash (TTL: 5-60 minutes)
- Use tiered retrieval: cheap vector search first, expensive reranker only on top-50
- Batch embedding calls during ingestion, not per-request
-> See
references/production-operations.md
"How do I evaluate my RAG pipeline?"
- Generate a test set: 50-100 question-answer pairs from your corpus
- Run RAGAS metrics: faithfulness, context precision, context recall, answer relevance
- Baseline first, then measure each pipeline change independently
- Set quality gates: faithfulness > 0.85, context precision > 0.75
-> See
references/evaluation.md
Decision Trees
Vector Database Selection
| Signal | Recommendation |
|---|---|
| Already using PostgreSQL | pgvector (simplest ops, hybrid via tsvector) |
| Managed, zero-ops required | Pinecone Serverless |
| Need native hybrid search + multi-tenancy | Weaviate |
| Maximum performance, self-hosted | Qdrant |
| Local dev / prototyping | Chroma |
| Enterprise, massive scale | Milvus or MongoDB Atlas Vector Search |
Chunking Strategy
| Content Type | Strategy |
|---|---|
| Prose documents (articles, reports) | Recursive character, 512-1024 tokens, 10% overlap |
| Markdown / structured docs | Markdown header splitter (preserve hierarchy) |
| Code repositories | Language-aware splitter (function/class boundaries) |
| Legal / regulatory | Semantic chunking (sentence-transformer boundaries) |
| FAQ / Q&A pairs | Keep each Q&A as a single chunk |
| Tables / structured data | Keep table intact, embed with surrounding context |
Retrieval Strategy
| Signal | Approach |
|---|---|
| Baseline / starting point | Single vector search, top-k=5 |
| Keyword-heavy queries fail | Add BM25 hybrid search + RRF |
| Top-k results contain noise | Add reranker (Cohere Rerank or cross-encoder) |
| Short/ambiguous queries | HyDE or multi-query expansion |
| Complex multi-part questions | Query decomposition → parallel retrieval → merge |
| Agent needs selective retrieval | Agentic RAG with tool-use |
Reference Index
| File | Contents |
|---|---|
references/chunking-strategies.md | Fixed, recursive, semantic, document-aware, parent-child, and hierarchical chunking with size/overlap guidance |
references/embedding-models.md | OpenAI, Cohere, open-source models, dimensionality reduction, Matryoshka, quantization, benchmarks |
references/vector-databases.md | pgvector, Pinecone, Weaviate, Qdrant, Chroma, Milvus comparison with indexing (HNSW, IVF), ops trade-offs |
references/retrieval-patterns.md | Hybrid search (BM25 + vector), metadata filtering, HyDE, multi-query, query decomposition, contextual compression |
references/reranking.md | Cohere Rerank, cross-encoders, ColBERT, reciprocal rank fusion, MMR diversity, scoring pipelines |
references/context-assembly.md | Prompt construction, citation/attribution, lost-in-the-middle, context window budgeting, source deduplication |
references/evaluation.md | RAGAS metrics (faithfulness, precision, recall), DeepEval, test set generation, quality gates, regression testing |
references/advanced-rag.md | Agentic RAG, self-RAG, CRAG, graph RAG (GraphRAG, RAPTOR), multimodal RAG, late chunking |
references/production-operations.md | Caching (embedding, retrieval, LLM), streaming, cost optimization, monitoring, ingestion pipelines, scaling |