Retrieval-augmented generation has become the default pattern for putting LLMs on top of private data. The demos are straightforward: chunk a PDF, throw the chunks into a vector database, retrieve the top matches, stuff them into a prompt, call GPT-4. It works on your README. Then you ship it to real users and it falls apart.
This post covers what actually breaks and the architecture decisions that matter when you move from demo to production. It’s the shape of RAG we build for clients at Mainix, refined across projects in legal, support, internal knowledge, and sales-enablement contexts.
Why naive RAG fails
The three-step pipeline (chunk, embed, retrieve) assumes that the user’s question and the document chunks share enough vocabulary that cosine similarity surfaces the right passages. This works for Q&A over short, homogeneous documents. It breaks down quickly:
- Questions are short, documents are long. A question like “what’s our refund policy?” has maybe four relevant words. A 50-page handbook has tens of thousands. Semantic similarity is noisy at that asymmetry.
- Users ask messy, multi-part questions. “Does our Enterprise tier include SSO, and is it available in the EU under GDPR?” needs two different retrievals combined.
- Context limits. You can’t just shove every possibly-relevant chunk into the prompt. You have to rank and filter.
- Hallucination with citations. LLMs will happily fabricate a plausible answer and cite a real document ID that doesn’t support it. Without validation, users trust the wrong output.
The production architecture
Every production RAG system we build has the same seven-layer shape. Skip any of them and something breaks.
1. Ingestion with structure-aware chunking
Fixed-size chunking (say, 500 tokens) destroys the structure of the source. A heading gets separated from its body, a table cell loses its row context, a function signature loses its docstring. We chunk based on the actual document structure, headers, sections, list items, table rows, and add overlap only where it helps recall. For PDFs, parse to a structured tree first (with tools like Unstructured.io or Docling) rather than flattening to plain text.
2. Metadata on every chunk
Each chunk gets attached metadata: source document, section path (H1 > H2 > H3), page number, document date, author, permissions, and domain tags. This metadata becomes the foundation for filtering, ranking, and citations. Skipping this step is the single most common reason RAG systems underperform.
3. Hybrid retrieval: vector + keyword
Pure vector search misses exact-match queries (“error code 4042” or “GDPR Article 17”). Pure keyword search misses semantic ones (“how do I export data” → “account download”). We always run both, a BM25 keyword index and a vector index, then fuse the results with reciprocal rank fusion. Postgres with pgvector plus a proper text search index handles both in one place for most projects; for larger scales we move to Pinecone or Weaviate with Elasticsearch on the side.
4. Query understanding and expansion
Before retrieval, we run the user’s question through a small LLM to rewrite it. This does three things: expands abbreviations (“MRR” → “monthly recurring revenue”), splits multi-part questions into sub-queries, and generates hypothetical answers (HyDE) for cases where the answer’s phrasing differs from the question’s. The cost is one extra LLM call per query; the quality improvement is dramatic.
5. Reranking
Retrieval gives you 20 to 50 candidate chunks. A reranker, usually a cross-encoder like Cohere Rerank or a BGE reranker, scores each (query, chunk) pair together and picks the top 5 to 10 for the prompt. This step alone often doubles answer quality. Skipping the reranker to save latency is false economy.
6. Generation with structured output and citations
The prompt doesn’t just contain the chunks. It has strict instructions: answer only from the provided context, cite every factual claim with a chunk ID, respond with a specific JSON structure if the output will be rendered programmatically. We use the model’s structured output features (OpenAI’s JSON Schema mode, Anthropic’s tool use) to force the shape. Free-form prose invites hallucination.
7. Evaluation and guardrails
Production RAG needs a continuous evaluation loop. We maintain a dataset of realistic questions with ground-truth answers, and run the pipeline against it on every change. Metrics: retrieval recall, answer accuracy, citation validity, latency, cost per query. For guardrails, we validate that every citation in the output actually supports the claim (a second LLM call or simple text match), and flag or block the answer if not.
What to build vs buy
The ecosystem has matured. You shouldn’t write everything from scratch, but you also shouldn’t abandon control to an all-in-one platform that locks you in.
- Build yourself: chunking strategy, metadata schema, retrieval fusion, prompt templates, evaluation harness. These are where your product differentiates.
- Use libraries: LlamaIndex or LangChain for orchestration glue, but lightly, they’re easy to get locked into.
- Buy: vector databases (Pinecone, Weaviate, pgvector), rerankers (Cohere, Voyage), observability (LangSmith, Helicone), LLM providers (OpenAI, Anthropic, Google). Don’t host your own model for production RAG unless you have a specific reason.
Cost and latency reality
Real users ask a lot of questions. A RAG system that feels great in the demo at $0.10 per query will cost $30,000 a month at 10k queries a day, which is nothing for a mid-size B2B product. The main levers:
- Cache aggressive. Embedding cache, query-expansion cache, and LLM response cache for identical questions.
- Use smaller models where you can. Query rewriting and reranking don’t need GPT-4. A Haiku or GPT-4o-mini is fine and cuts cost 10x.
- Stream the answer. Perceived latency matters more than total latency. Start streaming tokens the moment generation begins.
- Batch where possible. Embedding calls, reranking calls, evaluation runs, all batch-friendly, much cheaper in bulk.
What we’d tell a team starting today
Start with a single-source, single-question RAG, one document type, one question type, one user group. Get evaluation set up on day one. Ship the thinnest possible version to real users, collect failure cases, iterate on retrieval and prompts. Expand scope only after the narrow case works reliably. Most failed RAG projects tried to do too much at once before having infrastructure to know when things broke.
And keep a paper trail. Every chunk, every retrieval, every generation, every user rating, log it all. RAG systems get better with data about how they’re used. That data is the moat.