Module 5: RAG Architecture — Retrieval-Augmented Generation Deep Dive
Duration: 90-120 minutes | Level: Deep-Dive Audience: Cloud Architects, Platform Engineers, AI Engineers Last Updated: March 2026
5.1 Why RAG Exists
Large Language Models are powerful, but they have three fundamental limitations that make them unreliable for enterprise use without additional architecture.
Limitation 1: Knowledge Cutoff
Every LLM is frozen in time. GPT-4o's training data has a cutoff. Claude's training data has a cutoff. If your organization published a new security policy last Tuesday, no LLM on Earth knows about it. The model will either refuse to answer or, worse, confidently fabricate something plausible.
Limitation 2: Hallucination
When an LLM does not know the answer, it does not say "I don't know." It generates a plausible-sounding response that may be entirely fabricated. This is not a bug — it is a fundamental property of how autoregressive language models work. They predict the next most likely token, and "likely" does not mean "true."
Limitation 3: No Access to Your Data
Even if an LLM had perfect knowledge of the entire public internet, it still would not know your internal HR policies, your proprietary product catalog, your customer contracts, your Confluence wiki, or your internal runbooks. Enterprise data is private by definition.
The RAG Solution
Retrieval-Augmented Generation (RAG) solves all three problems with one architectural pattern: instead of relying on what the model already knows, you retrieve relevant information from your own data sources and inject it into the prompt at query time. The model then generates a response grounded in your actual data.
Without RAG: User Question → LLM (uses training data only) → Response (may hallucinate)
With RAG: User Question → Retrieve from YOUR data → LLM (uses retrieved context) → Grounded Response
RAG does not change the model. It does not retrain it. It simply gives the model a reference library to consult before answering.
RAG vs Fine-Tuning vs Prompt Engineering — Decision Matrix
Before building a RAG pipeline, understand where it fits alongside other techniques.
| Dimension | Prompt Engineering | RAG | Fine-Tuning | Pre-Training |
|---|---|---|---|---|
| What it does | Better instructions to the model | Gives the model your data at query time | Adjusts model weights on your data | Trains a model from scratch |
| Cost | Free | $$ (search infra + embeddings) | $$$ (GPU compute + data prep) | $$$$$$ (massive compute) |
| Time to implement | Minutes | Hours to days | Days to weeks | Months |
| Data freshness | N/A | Real-time (index updates) | Stale (requires retraining) | Stale |
| Data volume needed | 0 examples | A document corpus | 100s-1000s of examples | Billions of tokens |
| Handles private data | No | Yes | Partially (baked into weights) | Partially |
| Provides citations | No | Yes (source documents) | No | No |
| Risk of hallucination | High | Low (when well-built) | Medium | Medium |
| Best analogy | Writing better exam questions | Giving the student a textbook during the exam | Sending the student to a specialized course | Building a student from scratch |
- Prompt Engineering first — always. It is free and high-leverage.
- RAG when you need factual grounding in private, dynamic, or recent data.
- Fine-Tuning when you need the model to adopt a specific tone, format, or domain vocabulary.
- Combine them — production systems typically use all three together.
5.2 The RAG Pipeline End-to-End
Every RAG system has two distinct pipelines that operate independently.
The Two Pipelines
Ingestion Pipeline (Offline)
The ingestion pipeline runs periodically or on-demand to process your source documents and make them searchable. It does not involve any LLM calls.
| Step | What Happens | Azure Service | Latency |
|---|---|---|---|
| 1. Document Loading | Read files from storage | Azure Blob Storage, SharePoint, SQL | Seconds |
| 2. Content Extraction | Extract text, tables, images from documents | Azure AI Document Intelligence | Seconds/doc |
| 3. Cleaning | Remove headers, footers, boilerplate, artifacts | Custom code or AI Document Intelligence | Milliseconds |
| 4. Chunking | Split text into retrieval-sized segments | Custom code or Azure AI Search integrated vectorization | Milliseconds |
| 5. Enrichment | Add metadata: title, date, category, source | Azure AI Search skillsets | Seconds |
| 6. Embedding | Convert each chunk to a vector | Azure OpenAI text-embedding-3-large | ~100ms/chunk |
| 7. Indexing | Store vectors + metadata in search index | Azure AI Search, Cosmos DB | Milliseconds |
Query Pipeline (Online)
The query pipeline runs in real time for every user question. Latency matters here.
| Step | What Happens | Typical Latency |
|---|---|---|
| 1. Query Processing | Rewrite, expand, decompose the user's question | 200-500ms (if using LLM) |
| 2. Query Embedding | Convert question to a vector | 50-100ms |
| 3. Retrieval | Search the index (vector + keyword + filters) | 50-200ms |
| 4. Reranking | Reorder results by semantic relevance | 100-300ms |
| 5. Context Assembly | Build the prompt with retrieved chunks | <10ms |
| 6. LLM Generation | Generate the answer using the model | 500-3000ms |
| Total | End-to-end latency | 1-4 seconds |
5.3 Document Ingestion and Processing
The quality of your RAG system is capped by the quality of your ingestion pipeline. If the extracted text is garbled, no amount of sophisticated retrieval will save it.
Supported Document Types
| Format | Extraction Method | Complexity | Notes |
|---|---|---|---|
| Markdown | Direct parsing | Low | Best format for RAG — structure is explicit |
| Plain Text | Direct reading | Low | No structure to leverage |
| HTML | HTML parser + cleanup | Medium | Remove nav, scripts, ads |
| PDF (digital) | Text extraction | Medium | Preserves some structure |
| PDF (scanned) | OCR required | High | Quality depends on scan quality |
| Word (.docx) | XML parsing | Medium | Preserves headings, tables |
| Excel (.xlsx) | Cell-level extraction | Medium | Convert rows to text or keep tabular |
| PowerPoint (.pptx) | Slide-by-slide extraction | Medium | Speaker notes often contain key info |
| CSV / TSV | Row-level parsing | Low | Each row can be a chunk |
| Database tables | SQL queries | Medium | Denormalize joins into text |
| Images | Vision models / OCR | High | Diagrams, whiteboard photos |
Azure AI Document Intelligence
Azure AI Document Intelligence (formerly Form Recognizer) is the recommended service for structured extraction from complex documents. It handles:
- Layout analysis — detects headings, paragraphs, tables, figures
- Table extraction — returns structured table data preserving rows and columns
- Key-value pair extraction — for forms and invoices
- OCR — for scanned documents and handwriting
- Custom models — train on your specific document layouts
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
endpoint="https://my-doc-intel.cognitiveservices.azure.com/",
credential=AzureKeyCredential(os.getenv("DOC_INTEL_KEY"))
)
# Analyze a PDF with layout model
with open("annual_report.pdf", "rb") as f:
poller = client.begin_analyze_document(
model_id="prebuilt-layout",
body=f,
content_type="application/pdf"
)
result = poller.result()
# Extract text by page with structure
for page in result.pages:
print(f"--- Page {page.page_number} ---")
for line in page.lines:
print(line.content)
# Extract tables separately (critical for RAG quality)
for table in result.tables:
print(f"Table with {table.row_count} rows, {table.column_count} columns")
for cell in table.cells:
print(f" Row {cell.row_index}, Col {cell.column_index}: {cell.content}")
Metadata Extraction and Enrichment
Raw text is not enough. Metadata enables filtered search, which dramatically improves retrieval relevance. Always extract and store:
| Metadata Field | Source | Purpose |
|---|---|---|
title | Document title or filename | Display in citations |
source_url | Original location (SharePoint, Blob) | Link back to source |
created_date | File metadata | Freshness filtering |
modified_date | File metadata | Freshness filtering |
author | File metadata | Attribution |
department | Folder structure or tag | Scope filtering |
document_type | File extension or classification | Type filtering |
language | Language detection | Multi-language support |
chunk_index | Assigned during chunking | Ordering adjacent chunks |
parent_document_id | Assigned during ingestion | Linking chunks to source |
5.4 Chunking Strategies — Critical for Relevance
Chunking is the single most impactful design decision in your RAG pipeline. Bad chunking means bad retrieval, which means bad answers. There is no model smart enough to overcome poorly chunked data.
The fundamental tension: chunks must be small enough to be relevant to specific queries, but large enough to contain sufficient context to be useful.
Strategy 1: Fixed-Size Chunking
Split text into chunks of a fixed number of characters or tokens, with optional overlap.
def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 50):
"""Split text into fixed-size chunks with overlap."""
tokens = tokenizer.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunks.append(tokenizer.decode(chunk_tokens))
start += chunk_size - overlap # overlap tokens carried forward
return chunks
| Aspect | Detail |
|---|---|
| How it works | Split every N tokens regardless of content |
| Overlap | Typically 10-20% (e.g., 50-100 tokens for 512-token chunks) |
| Pros | Simple, predictable chunk sizes, easy to estimate storage |
| Cons | Breaks sentences, paragraphs, and semantic units mid-thought |
| Best for | Homogeneous text without strong structure (e.g., chat logs) |
Strategy 2: Sentence-Based Chunking
Split text at sentence boundaries, grouping sentences until the chunk reaches a target size.
import nltk
def sentence_chunk(text: str, max_tokens: int = 512):
"""Group sentences into chunks up to max_tokens."""
sentences = nltk.sent_tokenize(text)
chunks, current_chunk = [], []
current_size = 0
for sentence in sentences:
sentence_tokens = len(tokenizer.encode(sentence))
if current_size + sentence_tokens > max_tokens and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk, current_size = [], 0
current_chunk.append(sentence)
current_size += sentence_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
| Aspect | Detail |
|---|---|
| How it works | Accumulate sentences until target size is reached |
| Pros | Never breaks a sentence in half, preserves basic meaning |
| Cons | Variable chunk sizes, may still split related paragraphs |
| Best for | Narrative text, articles, documentation |
Strategy 3: Paragraph-Based Chunking
Split at paragraph boundaries (double newlines). Each paragraph becomes a chunk, or paragraphs are grouped to reach a target size.
| Aspect | Detail |
|---|---|
| How it works | Use paragraph breaks as natural split points |
| Pros | Preserves author's intended logical groupings |
| Cons | Paragraphs vary wildly in size (some are 1 sentence, some are 500 words) |
| Best for | Well-structured documents with consistent paragraph sizes |
Strategy 4: Semantic Chunking
Use embedding similarity to detect where the topic shifts, then split at topic boundaries.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def semantic_chunk(sentences: list, embeddings: np.ndarray, threshold: float = 0.75):
"""Split at points where semantic similarity drops below threshold."""
chunks, current_chunk = [], [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(
embeddings[i-1].reshape(1, -1),
embeddings[i].reshape(1, -1)
)[0][0]
if similarity < threshold:
# Topic shift detected — start a new chunk
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
| Aspect | Detail |
|---|---|
| How it works | Embed each sentence, detect topic boundaries via similarity drops |
| Pros | Produces the most semantically coherent chunks |
| Cons | Expensive (every sentence must be embedded), variable sizes, harder to debug |
| Best for | High-value corpora where retrieval quality justifies the cost |
Strategy 5: Recursive / Hierarchical Chunking
Split using a hierarchy of separators: first by headers, then by paragraphs, then by sentences, then by characters. Only descend to the next level if the chunk exceeds the target size.
# LangChain's RecursiveCharacterTextSplitter uses this approach
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=[
"\n## ", # H2 headings first
"\n### ", # H3 headings
"\n\n", # Paragraphs
"\n", # Lines
". ", # Sentences
" ", # Words (last resort)
]
)
chunks = splitter.split_text(document_text)
| Aspect | Detail |
|---|---|
| How it works | Try the most structural separator first, fall back to smaller units |
| Pros | Respects document hierarchy, good balance of structure and size control |
| Cons | Requires well-formatted source documents |
| Best for | Markdown, HTML, structured documentation (this is the default recommendation) |
Strategy 6: Document-Aware Chunking
Specialized chunking that understands the document format.
| Sub-Strategy | How It Works | Best For |
|---|---|---|
| Markdown-aware | Split at #, ##, ### headers keeping each section intact | Technical docs, wikis |
| HTML-aware | Split at <section>, <article>, <h1>-<h6> tags | Web content |
| Table-aware | Keep entire tables as single chunks (never split a table row) | Financial reports, data sheets |
| Code-aware | Keep entire functions/classes as single chunks | Code repositories |
| Slide-aware | Keep each slide as a chunk with speaker notes | Presentations |
Chunking Strategy Comparison
| Strategy | Chunk Size Control | Semantic Quality | Implementation Cost | Compute Cost | Best For |
|---|---|---|---|---|---|
| Fixed-Size | Exact | Low | Very Low | Very Low | Uniform text, quick prototypes |
| Sentence-Based | Approximate | Medium | Low | Low | Articles, narratives |
| Paragraph-Based | Variable | Medium-High | Low | Low | Well-structured docs |
| Semantic | Variable | Highest | High | High (embedding calls) | High-value corpora |
| Recursive | Approximate | High | Medium | Low | Markdown, HTML, general docs |
| Document-Aware | Variable | High | Medium-High | Low | Format-specific content |
Using fixed-size chunking with no overlap on structured documents. A 512-token boundary that falls in the middle of a table or code block produces two useless chunks. Always use document-aware or recursive chunking for structured content.
5.5 Chunk Size — The Goldilocks Problem
Chunk size is the second most impactful parameter after chunking strategy. The optimal size depends on your content type, embedding model, and retrieval patterns.
The Tradeoff
Too Small (< 128 tokens) Just Right (256-1024 tokens) Too Large (> 2048 tokens)
├── Loses context ├── Contains enough context ├── Dilutes relevance
├── Many chunks needed ├── Manageable number of chunks ├── Wastes LLM tokens
├── Retrieval returns fragments ├── Each chunk is self-contained ├── Hard to rank accurately