Applied AI

RAG in practice: architecting an agent over your company's knowledge base

RAG is neither magic nor fine-tuning. It is a retrieval architecture that anchors the LLM's answers in your real content — with source citations and less hallucination.

Applied AIJohnny Carreiro·March 31, 2026·3 min read

RAG (Retrieval-Augmented Generation) has become a slide acronym. But behind the buzzword is a simple, powerful architecture that solves the most common applied-AI problem in companies: making an LLM answer based on your knowledge, not on what it learned from the internet, with a verifiable source and no retraining.

The problem RAG solves

A bare LLM knows a lot about the world and nothing about your company. Ask about your internal return policy, your catalog, or a customer's history, and it will invent a plausible answer — hallucinate — because it was trained to always answer, not to say "I don't know."

Two wrong solutions appear first. Fine-tuning, which is expensive, slow, and bakes the knowledge into the model (update the content and you need to retrain). And pasting everything into the prompt, which blows the context limit and mixes relevant information with noise. RAG is the third path, and almost always the right one.

How RAG works

The core idea: instead of the model knowing everything, it retrieves what it needs at question time and answers only based on what it found.

The flow has two phases. In indexing (offline), you split your documents into chunks, generate an embedding for each — a vector that represents the meaning of the text — and store those vectors in a vector database (pgvector, for example). In querying (online), the user's question also becomes an embedding, you fetch the most similar chunks by vector proximity, and inject those passages into the LLM prompt along with the question. The model answers anchored in those passages — and cites where it got them.

The practical difference is huge: the model stops guessing and starts citing. Updated a document? Re-index just that one, and the next answer already uses the new content. No retraining, no fine-tuning cost.

What actually makes RAG work

The basic flow fits in a tutorial. What separates a demo from a production system is the details.

Smart chunking. Splitting by a fixed character count cuts sentences in half and destroys meaning. Splitting by section, paragraph, or the document's logical structure preserves context and improves retrieval.

Retrieval quality. If the system fetches the wrong chunks, the model answers wrong with confidence. It is worth measuring: for a set of known questions, are the right passages being retrieved? Techniques like re-ranking and hybrid search (vector + keyword) raise precision substantially.

Source citation. In production, the answer has to say where it came from. That lets the user verify and builds trust — and it is a requirement in regulated sectors.

Evals. Without a set of test questions with expected answers, you do not know whether a change improved or worsened the system. Evals turn "seems better" into "accuracy went from 78% to 91%."

When RAG is not the answer

RAG shines when there is a textual knowledge base that changes frequently and answers need to cite a source. It is not the right tool for everything: if the task is a deterministic calculation, a structured database query, or a decision with fixed rules, a SQL query or an agent with tools solves it better than semantic retrieval. And if the knowledge is stable and fits in the context, sometimes pasting it into the prompt is enough.

The practical path

Start small: one document base, pgvector, section-based chunking, and a set of 20-30 evaluation questions. Measure accuracy, tune chunking and retrieval, add re-ranking if needed. Only then think about scale — embedding caches, incremental index updates, query observability.

RAG is not magic. It is retrieval engineering plus an LLM, done with judgment. Well architected, it turns the company's scattered knowledge into an assistant that answers with a source — and that you control entirely.