A complete systems treatment — chunking strategies, embedding geometry, contextual retrieval, late chunking, hybrid scoring, re-ranking, context assembly, and failure modes. Every component derived from first principles.
Sections · 9 Exhibits · 9 Scope · Full pipeline
including Contextual RAG,
Late Chunking, BM25+dense
hybrid, RRF, re-ranking,
context window assembly Date · Feb 2026
Abstract. Retrieval-Augmented Generation is not a single algorithm — it is a pipeline of seven distinct engineering decisions, each with its own failure modes and mathematical tradeoffs. This document derives each component from first principles: the information-theoretic basis for chunking, the geometry of embedding spaces and why context destroys cosine similarity, the Anthropic Contextual Retrieval mechanism and its effect on embedding distributions, JinaAI’s Late Chunking and the difference between pre-chunk and post-chunk pooling, the BM25/dense hybrid with Reciprocal Rank Fusion, cross-encoder re-ranking cost models, and the positional degradation problem in context window assembly. The goal is to give engineers a precise model of where each component can fail — and what to do about it.
§ 1
The Full RAG Pipeline
Seven components, three failure classes, one information flow
A RAG system has two phases that operate at different times: an offline indexing phase that processes documents into a searchable store, and an online query phase that retrieves and generates. Every performance problem in RAG traces to one of three failure classes: (A) recall failure — the correct chunk is not retrieved; (B) precision failure — irrelevant chunks dilute the context; (C) generation failure — the correct chunks are retrieved but the LLM fails to use them.
Exhibit 1 — Complete RAG Pipeline: Offline and Online Phases all components and data flows
Two components marked ★ dominate end-to-end performance: chunking strategy and embedding quality. Errors made at these stages cannot be corrected downstream. Re-ranking improves precision but cannot recover recall — if the correct chunk is not in the top-k retrieved by the retriever, re-ranking never sees it. The offline pipeline runs once; the online pipeline runs per query. Latency budget is entirely in the online phase.
§ 2
Chunking Strategies
The mathematics of split decisions — fixed, sentence, semantic, recursive, proposition, late
Chunking is the single decision with the highest leverage on retrieval quality, yet it is the most frequently treated as a hyperparameter to be tuned by trial and error. Every chunking strategy embeds an implicit model of what the retrieval unit should be. Making that model explicit reveals when each strategy fails.
The Chunking Objective — information theory framingGiven document D, partition it into chunks C = {c₁, …, cₙ} to maximise: Σᵢ I( c_relevant ; q ) − Σᵢ I( c_irrelevant ; q ) subject to |cᵢ| ≤ L_maxI(c ; q) = mutual information between chunk c and query qL_max = context budget allocated per chunk (balance: granularity vs. coverage)This is intractable in closed form. Every chunking strategy is an approximation.The core tradeoff: small chunks → high precision, low recall (context stripped) large chunks → high recall, low precision (diluted relevance signal)
Exhibit 2 — Six Chunking Strategies: Mechanism, Mathematics, and Failure Mode ordered by conceptual sophistication
① Fixed-Size (Character / Token)
chunks = [D[i:i+L] for i in range(0, len(D), L-overlap)]
overlap = 0.1L to 0.2L (prevents boundary splits)
Simplest strategy. Deterministic. No semantic awareness.
Overlap parameter prevents information loss at boundaries — critical for mid-sentence splits.
Failure mode: splits sentences, paragraphs, tables at arbitrary positions. Resulting chunks lack coherent meaning. Embeddings of incomplete sentences have poor quality.
When to use: homogeneous documents with uniform information density (legal clauses, standardised reports). Never use for prose or technical documentation.
complexity: trivial
② Sentence / Paragraph Boundary
splits = [". ", "? ", "! ", "\n\n"]
chunks = split_on_boundaries(D, splits, max_tokens=L)
merge small chunks: if len(c) < L_min → merge with next
Respects linguistic structure. Clean semantic units.
Hard minimum L_min prevents degenerate 3-token chunks (e.g., “Yes.” or “No.”).
Failure mode: sentence boundaries do not align with topic boundaries. A single paragraph can discuss two distinct concepts — splitting by sentence produces chunks that are topically coherent within but not across boundaries.
When to use: Q&A corpora, FAQ documents, content where sentences are naturally self-contained answers. Good default for most use cases.
complexity: low
③ Recursive / Hierarchical
separators = ["\n\n", "\n", ". ", " ", ""]
result = recursive_split(D, separators, max_tokens=L)
if len(chunk) > L: recurse with next separator
Tries paragraph → sentence → word splits in order — preserves as much structure as possible given token budget.
Failure mode: still character/token-driven at leaf level. Produces variable-length chunks which complicate batching and embedding. Does not detect semantic topic shifts.
When to use: general documents where you know nothing about structure. Good baseline before semantic chunking.
complexity: low
④ Semantic Chunking
E = [embed(sᵢ) for sᵢ in sentences(D)]
for i in range(1, len(E)):
sim[i] = cosine(E[i], E[i+w]) # w = window
split at i where sim[i] < threshold 𝜏
Embeds sentences, computes rolling cosine similarity, splits where similarity drops below threshold 𝜏.
Window w (1–3) smooths local variation — single-sentence dips don’t cause false splits.
Failure mode: requires per-document embedding pass — slow for large corpora. Threshold 𝜏 is sensitive: too low → few chunks, too high → fragmented. Gradual topic shifts (common in dense technical text) are missed.
Mathematics: percentile threshold more stable than absolute: 𝜏 = percentile(sim, 25) — split at lowest 25% of similarity scores.
complexity: moderate
⑤ Proposition Chunking
propositions = LLM_extract(chunk,
prompt="Extract atomic, self-contained facts.")
# each proposition is a standalone, verifiable claim
# e.g. "Paris is the capital of France."
Uses an LLM to rewrite chunks as a list of atomic, self-contained propositions (Chen et al., 2023 — DenseX Retrieval).
Each proposition is: (a) factual, (b) complete without external context, (c) as short as possible.
Why it works: propositions align perfectly with the typical query structure — a question seeking a specific fact. Cosine similarity between question and proposition embeddings is maximised because neither has extraneous words.
Failure mode: expensive — requires one LLM call per chunk. Misses relational and procedural knowledge that cannot be atomised. Long documents with 1000s of propositions → large index, high latency.
complexity: high — LLM call per chunk
⑥ Late Chunking (JinaAI, 2024)
# Standard: chunk THEN embed
e(cᵢ) = mean_pool(encoder(cᵢ)) ← no context
# Late chunking: embed THEN chunk (JinaAI)
H = encoder(D) # full document token embeddings
e(cᵢ) = mean_pool(H[start_i : end_i]) ← full context
Full mathematical derivation: §5.
Key insight: chunk boundaries applied after the transformer attention pass — every token attends to the full document before pooling.
Requires: a long-context embedding model (jina-embeddings-v2, e5-mistral, Voyage-2). Document must fit in model’s context window.
complexity: high — full-doc context required
Chunking strategy selection decision tree: (1) Is the document structured (headers, sections)? → Use structure-aware splitting first, then recursive within sections. (2) Is the index small enough to afford LLM preprocessing? → Proposition chunking for fact-heavy corpora. (3) Does the embedding model support long context? → Late chunking for the best context preservation. (4) General production default: recursive sentence splitting at 512 tokens with 10–15% overlap + semantic coherence check.
§ 3
Embedding Geometry and Retrieval Scoring
Cosine similarity, dot product, the hubness problem, and what contextual embeddings fix
Dense retrieval reduces to nearest-neighbor search in a high-dimensional space. The choice of similarity function, the geometry of the embedding space, and the normalisation of vectors all affect retrieval accuracy in ways that are mathematically precise — and frequently misunderstood.
Similarity Functions — when each is correctCosine similarity: cos(q, c) = (q · c) / (||q|| ||c||) ∈ [-1, 1] Measures angular similarity — invariant to vector magnitude. Correct when: magnitude is not informative (most text embeddings).Dot product: dot(q, c) = q · c = ||q|| ||c|| cos(θ) (no normalisation) Sensitive to magnitude. Biases toward high-norm vectors. Correct when: magnitude encodes importance (learned dense retrieval: DPR, ColBERT). OpenAI text-embedding-ada-002: L2-normalised → dot product == cosine.Euclidean (L2) distance: d(q, c) = ||q - c||₂ = sqrt( 2 - 2·cos(q,c) ) for unit vectors Equivalent to cosine for normalised vectors. Used by FAISS L2 index. Implementation note: maximise cosine ↔ minimise L2 for unit vectors.
The hubness problem. In high-dimensional spaces (d ≥ 100), certain vectors become “hubs” — they appear as the nearest neighbor to an anomalously large fraction of query vectors regardless of semantic content. This occurs because in high dimensions, all points concentrate near a thin shell at distance √d from the origin, and the variance of inter-point distances collapses. Hub vectors are retrieved frequently; peripheral vectors almost never, even when they are the correct answer. This is a fundamental geometric property of the embedding space, not a model deficiency. Remediation: reduce embedding dimension or re-rank to de-weight known hubs.
Approximate Nearest Neighbor — HNSW and IVF-PQ tradeoffsExact search: O(n · d) per query — infeasible at n > 10⁶HNSW (Hierarchical Navigable Small World): Build: O(n log n) | Query: O(log n) | Recall@10: 0.98+ ef_construction (build quality) and ef_search (query quality) tradeoff speed/recall In-memory graph. Best for n < 10⁷. Used by Weaviate, Qdrant, pgvector.IVF-PQ (Inverted File Index + Product Quantisation): Build: k-means centroids (nlist clusters) + encode vectors as product codes Query: O(d · nprobe) where nprobe = number of cells to search Memory: 8–16 bytes per vector vs 4d bytes exact (d=1536 → 6144 bytes exact vs ~16 bytes) PQ compression loses recall: @nprobe=50, recall@10 ~= 0.92 for d=1536 Best for n > 10⁷ where HNSW memory is prohibitive.
§ 4
Contextual Retrieval (Anthropic, 2024)
How prepending document context to chunks changes the embedding distribution and why it works
Standard RAG embeds each chunk in isolation. This creates a fundamental problem: a chunk like “The revenue increased by 12% in Q3” contains no information about which company, which year, or which revenue line. Its embedding is maximally ambiguous. When retrieved for a query about Apple’s Q3 performance, it may rank below irrelevant chunks that happen to mention Apple explicitly.
Anthropic’s Contextual Retrieval (September 2024) addresses this by prepending a brief, document-level context to each chunk before embedding. The context is generated by an LLM using the full document as input.
Contextual Retrieval — the mechanism and embedding effectStandard RAG: e_standard(cᵢ) = encoder( cᵢ ) ← chunk only, no document contextContextual Retrieval: ctx_i = LLM( document=D, chunk=cᵢ, prompt="Describe where this chunk fits in the document." ) c_contextual_i = ctx_i + "\n\n" + cᵢ e_contextual(cᵢ) = encoder( c_contextual_i ) ← context-enriched embeddingctx_i is typically 50-100 tokens. It is prepended, not appended —transformer attention weights beginning of sequence more reliably.Why the embedding changes: e_standard("Revenue increased 12% in Q3") ≈ e("generic financial metric") ctx = "This chunk is from Apple's 2024 annual report, Q3 results section." e_contextual(ctx + chunk) ≈ e("Apple Q3 2024 revenue growth") ← anchoredFormally: cos( e_contextual(cᵢ), e(q_specific) ) >> cos( e_standard(cᵢ), e(q_specific) )when q_specific = "Apple Q3 revenue 2024" — the embedding is pulled toward thequery cluster by the contextual tokens, not away from it.
Exhibit 3 — Contextual Retrieval: Embedding Space Before and After Context Injection geometric effect on chunk placement
Anthropic’s published results: Contextual Retrieval reduces retrieval failure rate by 49% in isolation, and 67% when combined with BM25 hybrid search (vs. standard BM25+dense baseline). The LLM cost for context generation is the main drawback — typically 1 Claude Haiku call per chunk. Prompt caching amortises this: cache the document-level prompt across all chunk calls. With caching, the marginal cost per chunk is approximately 1K input tokens (the chunk itself) at cache-hit pricing.
Prompt caching economics for Contextual RAG: For a 200-page document split into 400 chunks, context generation without caching requires 400 full document passes. With prompt caching (cache the document prefix), each call costs only the marginal tokens for that chunk — roughly 1/200th the cost. At Claude Haiku pricing, a 200-page document enrichment costs approximately $0.02–0.05 with caching vs. $4–8 without. Contextual RAG is only economically viable with prompt caching enabled.
§ 5
Late Chunking (JinaAI, 2024)
Pre-chunk vs. post-chunk pooling — the mathematical difference and when it dominates
Late Chunking is a fundamentally different approach to the context problem. Rather than enriching chunks with text before embedding, it changes when chunking occurs relative to the embedding computation. The insight is that transformer attention is context-dependent — every token embedding encodes information from its neighbors. Standard chunking discards this context before it reaches the pooling step.
Late Chunking — formal derivation of the pre/post-chunk pooling differenceStandard chunking (chunk BEFORE embed): cᵢ = tokens[start_i : end_i] ← isolate chunk before encoding H_i = transformer( cᵢ ) ← attention within chunk only e(cᵢ) = mean_pool( H_i ) ∈ R^d ← token embeddings have NO document context Problem: "The CEO said it would increase profits." The pronoun "it" is ambiguous within the chunk — its referent is in a previous chunk. H_i cannot resolve the coreference. The embedding is semantically underspecified.Late chunking (embed BEFORE chunk — JinaAI 2024): T = tokens(D) ← full document token sequence H = transformer( T ) ← attention across ENTIRE document e(cᵢ) = mean_pool( H[start_i : end_i] ) ← slice embeddings AFTER global attention Now H[j] for token j encodes: the token itself AND its full document context. "The CEO said it would increase profits." — "it" is resolved because H[it_position] attended to the referent "acquisition" in the prior sentence.Key requirement: |T| ≤ L_model (document must fit in model's context window) → Requires long-context embedding model: jina-embeddings-v2 (8192 tokens), e5-mistral-7b-instruct (32768 tokens), Voyage-2, Cohere-embed-v3 → Not possible with OpenAI text-embedding-ada-002 (8191 token context, short-doc only)
Exhibit 4 — Late Chunking vs. Standard: Attention Patterns and Pooling Difference what changes mathematically
Late chunking has an additional efficiency advantage: one encoder call per document vs. one per chunk. For a 400-chunk document, this is 400× cheaper per embedding operation. The tradeoff: requires a long-context embedding model and the document must fit the model’s context window. For documents longer than 8K–32K tokens, late chunking with an overlapping window approach is needed. JinaAI reports 10–20% retrieval improvement over standard chunking on multi-section technical documents.
Contextual RAG vs. Late Chunking — which to use. These solve the same problem through different mechanisms and are complementary. Contextual RAG (text prepend) works with any embedding model including short-context ones. Late chunking requires a long-context embedding model. For documents under 8K tokens, both apply: late chunking captures internal coreferences; contextual RAG adds document-level metadata the model hasn’t seen. For documents over 32K tokens, contextual RAG with windowed late chunking is the most robust approach.
Dense retrieval (vector similarity) and sparse retrieval (BM25) have complementary failure modes. Dense retrieval captures semantic meaning but misses exact keyword matches — query “RFC 7230” will fail to retrieve chunks containing “RFC 7230” if the embedding space has not seen the term. BM25 captures exact terms but misses synonyms and paraphrases — query “heart attack” misses chunks containing “myocardial infarction”. Hybrid search uses both and fuses the ranked lists.
BM25 — the complete formula and its parametersBM25(q, d) = Σ₀{t∈q} IDF(t) · (TF(t,d) · (k₁+1)) / (TF(t,d) + k₁·(1 - b + b·|d|/avgdl))IDF(t) = log( (N - df(t) + 0.5) / (df(t) + 0.5) + 1 ) N = total docs, df(t) = docs containing term tTF(t,d) = raw term frequency of t in document d|d| = document length in termsavgdl = average document length in corpusk₁ ∈ [1.2, 2.0] = term frequency saturation (default 1.5) high k₁ → TF more influential; low k₁ → TF saturates quicklyb ∈ [0, 1] = length normalisation (default 0.75) b=1 → full length normalisation; b=0 → no normalisationBM25 advantages: exact term match, no hallucination of semantic similarity, zero-shot (no training needed), fast (inverted index O(|q|·avg_df))BM25 failures: out-of-vocabulary queries, synonyms, multi-word concepts, semantically rich queries where exact terms are absent from relevant docs
Reciprocal Rank Fusion — the fusion function and why 60 is the magic constantRRF(d, {R₁, R₂, ..., Rₖ}) = Σᵢ 1 / (k_rrf + rank_i(d))k_rrf = 60 (empirically optimal across many datasets, Cormack et al. 2009)rank_i(d) = rank of document d in ranked list Rᵢ (1-indexed)If d not in Rᵢ: typically treated as rank → ∞, contributing 0For two rankers (dense + BM25):RRF(d) = 1/(60 + rank_dense(d)) + 1/(60 + rank_BM25(d))Why k_rrf=60 works: it down-weights rank differences at the top rank 1 contributes 1/61 ~= 0.0164 rank 2 contributes 1/62 ~= 0.0161 (only 2% less than rank 1) rank 10 contributes 1/70 ~= 0.0143 (13% less than rank 1) A document ranked 1st in BM25 and 10th in dense will outrank a document ranked 3rd in both — it benefits from the agreement bonus. RRF is parameter-free and requires no score normalisation across rankers.
Exhibit 5 — Hybrid Retrieval Pipeline: BM25 + Dense → RRF Fusion full scoring flow with example
RRF is parameter-free and robust to differences in score scales between BM25 and cosine similarity — no normalisation needed because it only uses ranks, not raw scores. The weight on each ranker is implicit in which results appear: if only dense returns d₉, its RRF score is 1/(60+rank_dense). Explicit weighting variants: weighted RRF = α/(60+rank₁) + β/(60+rank₂). For most production systems, α=β=1 (equal weight) performs comparably to tuned weights. HyDE: instead of embedding the query q, generate a hypothetical answer document with an LLM and embed that — the hypothetical answer is semantically closer to the relevant chunks than the short query string.
§ 7
Re-ranking
Cross-encoder vs. bi-encoder — the precision gap, cost model, and when re-ranking pays
The retrieval stage uses a bi-encoder: query and document are encoded independently, then compared by dot product. This is fast — embeddings can be pre-computed — but imprecise, because the encoding of the query and document never interact. Re-ranking uses a cross-encoder: query and document are fed jointly to a transformer, enabling full attention between them. This produces much better relevance scores but cannot be pre-computed.
Bi-encoder vs. Cross-encoder — the scoring differenceBi-encoder (retrieval): score_bi(q, c) = e(q) · e(c) ← embeddings computed independently Complexity: O(1) at query time (e(c) pre-computed offline) Interaction: NONE — q and c never attend to each otherCross-encoder (re-ranking): score_cross(q, c) = fθ( [CLS] q [SEP] c [SEP] ) → scalar Complexity: O(|q| + |c|)² at query time (no pre-computation possible) Interaction: FULL — every token of q attends to every token of cCross-encoder advantages over bi-encoder on BEIR benchmark: NDCG@10 improvement: +8 to +15 points depending on task Best on: exact answer matching, multi-hop reasoning, long documents Smallest gap: keyword lookup, code search (bi-encoder near-optimal)
Re-ranking Cost Model — when the latency is worth itRetrieval latency T_retrieve = O(log n) (ANN search, negligible)Re-ranking latency T_rerank = k_retrieve × T_cross_encodeT_cross_encode ~= 5-50ms per (q,c) pair on GPU, depending on chunk lengthk_retrieve = 50-200 candidates from retrieval→ T_rerank ~= 250ms–10s (often the dominant latency term in the pipeline)Optimisation: retrieve k_large (top-200), re-rank, pass top-n to LLM (n ≪ k_large)Parallelism: batch cross-encoder calls — GPU throughput reduces effective latencyBreak-even: re-ranking pays when precision improvement reduces LLM generation errorsby more than the latency and cost added. On long-context LLMs where the contextwindow cost dominates, reducing n via better ranking is worth significant latency.
Cross-encoder models
Cohere Rerank-3 (API): state-of-the-art, 4096 token context, multilingual. Best for production when latency budget allows. BGE-Reranker-v2-m3 (open): comparable quality, self-hostable, 8192 token context. cross-encoder/ms-marco-MiniLM-L-6-v2: smallest/fastest, good for latency-sensitive applications. LLM-as-reranker: instruct an LLM to score relevance — highest quality but expensive. SetRank, RankGPT: listwise reranking (score all k candidates jointly) vs. pointwise (score each independently). Listwise better captures relative ordering; pointwise parallelises.
NDCG@k — the correct retrieval metric
NDCG@k (Normalised Discounted Cumulative Gain) is the standard measure for ranked retrieval quality: NDCG@k = DCG@k / IDCG@k where DCG@k = Σᵢ (2^relᵢ - 1) / log₂(i+1). Discounts lower-ranked results logarithmically. IDCG = DCG of perfect ordering. NDCG rewards: (a) relevant documents appearing early and (b) higher-graded relevance. For RAG, use Recall@k (is the correct chunk in the top k?) rather than NDCG when there is a single gold chunk, and NDCG when relevance is graded.
After retrieval and re-ranking, the top-n chunks must be assembled into the prompt. This assembly step has its own failure mode: LLM performance degrades as a function of where in the context window information is placed. The “lost-in-the-middle” effect (Liu et al., 2023) is a systematic positional bias in current transformer architectures.
Lost-in-the-Middle — positional degradation modelP_recall(chunk at position i | n total chunks) ≈ f(i, n)Empirical shape (Liu et al. 2023, on GPT-3.5, Claude, LLaMA): f(1) ~= 0.92 ← first chunk: high recall f(n) ~= 0.88 ← last chunk: high recall f(n/2) ~= 0.56 ← middle chunk: recall drops by ~35ppU-shaped recall curve: primacy + recency effects dominate middle positions.Effect size scales with n and with context length: more chunks → deeper trough.Mitigation strategies: 1. Place highest-relevance chunks at positions 1 and n (first + last) 2. Reduce n: fewer, higher-quality chunks > many diluted chunks 3. Use models with explicit position attention (long-rope, ALiBi) — smaller effect 4. Repeat critical information: cite key facts in the query framing AND the context
Exhibit 6 — Context Window Assembly: Token Budget, Deduplication, and Positional Strategy full assembly algorithm
U-shape packing exploits the primacy and recency effect. The two most relevant chunks are placed at positions 1 and n (highest recall). The third most relevant is placed at position 2 (second-best recall), the fourth at position n-1, and so on. This is a deterministic, zero-cost optimisation that increases the expected recall of high-relevance chunks without any retrieval changes. The effect size is largest for n > 10 chunks and for long-context windows where the middle is far from both ends.
§ 9
Failure Mode Taxonomy
Complete classification of what breaks, where, and how to diagnose each failure class
Fixed-size split cuts sentence mid-way. Embedding of fragment is incoherent.
High embedding variance within same document. Chunks with dangling pronouns or incomplete sentences.
Switch to sentence-boundary or semantic chunking. Add overlap (15%). Minimum chunk length filter.
Semantic ambiguity
Embedding
Chunk lacks entity context — “it increased by 12%” embeds as generic financial statement, not Apple Q3 specific.
Correct chunk retrieved for broad queries but not specific ones. Recall drops with increasing query specificity.
Contextual Retrieval (§4). Add entity/section metadata to chunk text before embedding.
Coreference failure
Embedding
“They decided to acquire it” — pronouns unresolved because referents in prior chunk, stripped at chunking.
Correct chunk retrieved for standalone fact queries; fails when query depends on pronoun in chunk.
Late chunking (§5). Increase overlap. Coreference resolution as preprocessing step.
Lexical mismatch
Retrieval
Dense embedding maps “heart attack” and “myocardial infarction” differently — vocabulary gap. Or rare acronym not in training data.
High recall on paraphrase queries, low recall on exact-term queries. BM25 recovers what dense misses.
Add BM25 to hybrid retrieval with RRF fusion (§6). Query expansion with synonyms.
Hubness bias
Retrieval
Certain high-norm vectors retrieved for almost all queries regardless of relevance. High-dimensional geometric effect.
Same k documents appearing in top-10 across semantically diverse queries. Check retrieval frequency histogram.
L2 normalise all embeddings. Re-rank to down-weight known hubs. Reduce embedding dimension.
Semantic dilution
Retrieval
Multi-topic chunks embed to centroid — no single query is close to a chunk discussing two unrelated concepts.
Relevant information known to exist in corpus but never retrieved. Manual inspection shows chunks are multi-topic.
Reduce chunk size. Use proposition chunking to atomise multi-topic chunks. Semantic chunking to split at topic boundaries.
Re-ranker distribution shift
Re-ranking
Cross-encoder trained on web (MS-MARCO) but corpus is domain-specific (medical, legal, code). Score scale invalid.
Re-ranking hurts metrics vs. raw retrieval on domain corpus. Cross-encoder scores low-variance (all chunks near same score).
Fine-tune cross-encoder on domain data. Use domain-specific re-ranker (Cohere Rerank-3 multilingual handles this better).
Lost in the middle
Assembly
LLM fails to use information at middle positions in long context. Primacy/recency bias.
Correct chunk retrieved and confirmed in context. LLM answer ignores it. Position of correct chunk is at n/2.
U-shape packing (§8). Reduce n. Repeat key information in query framing. Use models with better long-context attention.
Context contamination
Assembly
Retrieved chunks from different sources contradict each other. LLM averages or hallucinates between them.
LLM gives hedged, contradictory, or wrong answers on factual queries. Multiple sources give conflicting information.
Include source metadata; instruct LLM to cite and prefer most recent / most authoritative. Deduplicate contradictory chunks by date/authority.
Generation hallucination
Generation
LLM generates information not present in context, despite correct retrieval. Model prior overrides retrieved context.
Correct chunks present in context. LLM answer contains facts not in any chunk.
Explicit grounding instruction: “Answer ONLY from the provided context. If not present, say so.” Reduce temperature. Add faithfulness verification step.
Diagnosis order: first check retrieval (is the correct chunk in top-k?), then check context (is it present in the assembled prompt?), then check generation (did the LLM use it?). Most production failures are retrieval failures — the LLM is rarely the bottleneck. A systematic evaluation requires: ground-truth (query, answer, source chunk) triples; Recall@k for retrieval; answer correctness for generation; and attributability (can every claim in the answer be traced to a retrieved chunk).
Exhibit 8 — Decision Matrix: Which RAG Enhancement for Which Problem systematic selection guide
Symptom
Diagnosis
Primary fix
Secondary fix
Cost
Correct chunks not retrieved at all
Lexical mismatch or embedding not trained on domain vocabulary
Maximal Marginal Relevance (MMR): a diversity-aware retrieval strategy that selects chunks maximising both relevance and diversity. MMR(cᵢ) = λ·sim(cᵢ, q) − (1−λ)·max₀{cⱼ∈S} sim(cᵢ, cⱼ) where S is the already-selected set. λ=1 → pure relevance; λ=0 → pure diversity. Useful when the corpus contains many near-duplicate chunks (e.g., a document repeated with minor edits) that would otherwise dominate the retrieved set.
Exhibit 9 — Complexity vs. Retrieval Improvement: Component Value Map where to invest engineering effort
Build order for production RAG: (1) Sentence/recursive chunking with overlap — immediate baseline improvement over fixed-size, zero additional infrastructure. (2) BM25 hybrid + RRF — add Elasticsearch/BM25S alongside dense, fuse with RRF. Largest single retrieval improvement per unit effort. (3) Contextual Retrieval — add LLM context prepend during indexing with prompt caching; 49% failure reduction. (4) Cross-encoder re-ranking — add Cohere Rerank or BGE-Reranker for precision-critical use cases. (5) Late chunking or proposition chunking for specific document types where coreference or atomicity matters. Fine-tuned embeddings only when domain vocabulary is highly specialised and sufficient labeled data exists.