@tank/rag-production
1.0.0Skill
Description
Production RAG architecture covering chunking, embeddings, vector databases, hybrid search, reranking, advanced retrieval, evaluation, and production operations.
Triggered by
RAGretrieval augmented generationchunkingembeddingsvector databasehybrid search
Download
Verified
tank install @tank/rag-productionRAG Production
Core Philosophy
- Retrieval quality trumps generation quality — A perfect LLM cannot compensate for irrelevant context. Invest 80% of effort in the retrieval pipeline (chunking, indexing, search, reranking) before tuning prompts.
- Start naive, measure, then optimize — Begin with recursive character splitting + single vector search. Add hybrid search, reranking, and query expansion only when metrics prove the baseline insufficient.
- Chunk for retrieval, not for storage — Chunk boundaries determine what the model sees. Optimize chunk size and overlap for answer completeness, not disk efficiency.
- Evaluate continuously with RAGAS — Track faithfulness, context precision, context recall, and answer relevance on every pipeline change. Gut-feel evaluation hides regression.
- Cache aggressively, embed once — Embedding computation and reranking are the dominant costs. Cache embeddings at ingestion, cache retrieval results per query hash, batch embed during off-peak.
Quick-Start: Common Problems
"How do I build a basic RAG pipeline?"
- Load documents with appropriate loaders (PDF, markdown, HTML)
- Split into chunks: recursive character splitter, 512-1024 tokens, 10-20% overlap
- Embed chunks with
text-embedding-3-small(cost-effective) ortext-embedding-3-large(quality) - Store in vector database (pgvector for existing Postgres, Pinecone for managed)
- Retrieve top-k (k=5-10) by cosine similarity on query embedding
- Construct prompt: system instructions + retrieved context + user query
- Generate with LLM, include source citations
-> See
references/chunking-strategies.mdandreferences/embedding-models.md
"Retrieval returns irrelevant results"
- Check chunk size — too large buries signal, too small loses context
- Add hybrid search: combine BM25 full-text + vector similarity with RRF
- Add a reranker (Cohere Rerank or cross-encoder) as second-stage filter
- Try HyDE: generate a hypothetical answer, embed that instead of the raw query
- Verify embedding model matches your domain (multilingual, code, etc.)
-> See
references/retrieval-patterns.mdandreferences/reranking.md
"RAG is too expensive in production"
- Use
text-embedding-3-smallwith reduced dimensions (512d via Matryoshka) - Cache embeddings — never re-embed unchanged documents
- Cache retrieval results by query hash (TTL: 5-60 minutes)
- Use tiered retrieval: cheap vector search first, expensive reranker only on top-50
- Batch embedding calls during ingestion, not per-request
-> See
references/production-operations.md
"How do I evaluate my RAG pipeline?"
- Generate a test set: 50-100 question-answer pairs from your corpus
- Run RAGAS metrics: faithfulness, context precision, context recall, answer relevance
- Baseline first, then measure each pipeline change independently
- Set quality gates: faithfulness > 0.85, context precision > 0.75
-> See
references/evaluation.md
Decision Trees
Vector Database Selection
| Signal | Recommendation |
|---|---|
| Already using PostgreSQL | pgvector (simplest ops, hybrid via tsvector) |
| Managed, zero-ops required | Pinecone Serverless |
| Need native hybrid search + multi-tenancy | Weaviate |
| Maximum performance, self-hosted | Qdrant |
| Local dev / prototyping | Chroma |
| Enterprise, massive scale | Milvus or MongoDB Atlas Vector Search |
Chunking Strategy
| Content Type | Strategy |
|---|---|
| Prose documents (articles, reports) | Recursive character, 512-1024 tokens, 10% overlap |
| Markdown / structured docs | Markdown header splitter (preserve hierarchy) |
| Code repositories | Language-aware splitter (function/class boundaries) |
| Legal / regulatory | Semantic chunking (sentence-transformer boundaries) |
| FAQ / Q&A pairs | Keep each Q&A as a single chunk |
| Tables / structured data | Keep table intact, embed with surrounding context |
Retrieval Strategy
| Signal | Approach |
|---|---|
| Baseline / starting point | Single vector search, top-k=5 |
| Keyword-heavy queries fail | Add BM25 hybrid search + RRF |
| Top-k results contain noise | Add reranker (Cohere Rerank or cross-encoder) |
| Short/ambiguous queries | HyDE or multi-query expansion |
| Complex multi-part questions | Query decomposition → parallel retrieval → merge |
| Agent needs selective retrieval | Agentic RAG with tool-use |
Reference Index
| File | Contents |
|---|---|
references/chunking-strategies.md | Fixed, recursive, semantic, document-aware, parent-child, and hierarchical chunking with size/overlap guidance |
references/embedding-models.md | OpenAI, Cohere, open-source models, dimensionality reduction, Matryoshka, quantization, benchmarks |
references/vector-databases.md | pgvector, Pinecone, Weaviate, Qdrant, Chroma, Milvus comparison with indexing (HNSW, IVF), ops trade-offs |
references/retrieval-patterns.md | Hybrid search (BM25 + vector), metadata filtering, HyDE, multi-query, query decomposition, contextual compression |
references/reranking.md | Cohere Rerank, cross-encoders, ColBERT, reciprocal rank fusion, MMR diversity, scoring pipelines |
references/context-assembly.md | Prompt construction, citation/attribution, lost-in-the-middle, context window budgeting, source deduplication |
references/evaluation.md | RAGAS metrics (faithfulness, precision, recall), DeepEval, test set generation, quality gates, regression testing |
references/advanced-rag.md | Agentic RAG, self-RAG, CRAG, graph RAG (GraphRAG, RAPTOR), multimodal RAG, late chunking |
references/production-operations.md | Caching (embedding, retrieval, LLM), streaming, cost optimization, monitoring, ingestion pipelines, scaling |