Retrieval Strategies (Beyond Top‑K)¶
Most RAG systems start with:
1) embed the query
2) retrieve top‑K chunks by vector similarity
3) paste those chunks into a prompt
That works, but “top‑K” alone often fails in predictable ways: - you get many chunks from the same doc (low diversity) - you miss exact keywords (IDs, error codes) - you retrieve near-duplicates (wasted context) - you retrieve the right chunks but in the wrong order or missing context
This page gives a practical “retrieval ladder” you can climb as your system grows.
The retrieval ladder¶
1) Top‑K dense retrieval (baseline)
2) Metadata filters (scope the search)
3) Diversity (MMR / per-doc caps) (reduce duplicates)
4) Hybrid retrieval (FTS + vectors) (catch exact keywords)
5) Reranking (sort candidates by a stronger model)
6) Query rewriting / multi-query (improve recall)
You don’t need all of them. Add the next rung only when you see a clear failure mode.
1) Top‑K dense retrieval¶
Baseline pgvector query:
SELECT id, doc_id, section_path, content
FROM chunks
ORDER BY embedding <=> $1::vector
LIMIT 8;
Good defaults
- start with k=6–10
- keep chunks ~200–400 tokens (see: Chunking)
2) Metadata filters (huge win)¶
Filters improve both quality and latency because you search a smaller space.
Examples:
-- Only retrieve from a specific product
SELECT id, content
FROM chunks
WHERE metadata @> '{"product":"enterprise"}'::jsonb
ORDER BY embedding <=> $1::vector
LIMIT 8;
-- Only retrieve from a subset of sources
SELECT id, content
FROM chunks
WHERE metadata->>'source_type' = 'docs'
ORDER BY embedding <=> $1::vector
LIMIT 8;
3) Diversity: per-doc caps and MMR¶
Per-document cap (simple and effective)¶
If your top results come from one long document, cap results per doc_id.
One SQL approach:
WITH ranked AS (
SELECT
id,
doc_id,
content,
row_number() OVER (PARTITION BY doc_id ORDER BY embedding <=> $1::vector) AS doc_rnk,
embedding <=> $1::vector AS distance
FROM chunks
)
SELECT id, doc_id, content
FROM ranked
WHERE doc_rnk <= 2
ORDER BY distance
LIMIT 8;
MMR (Maximal Marginal Relevance)¶
MMR trades off: - relevance to the query - novelty vs already-selected chunks
MMR is typically done in application code (not SQL). High-level pseudocode:
selected = []
while len(selected) < k:
pick chunk that maximizes:
lambda * sim(query, chunk) - (1-lambda) * max(sim(chunk, s) for s in selected)
Start with lambda=0.7.
4) Hybrid retrieval (dense + keyword)¶
Use hybrid retrieval when users type: - exact names (“OAuth”, “SAML”, “HIPAA”) - error codes (“E11000”, “403”) - IDs (“INV-1029”)
Postgres makes hybrid retrieval easy. A strong default is RRF fusion:
See: SQL for RAG
5) Reranking¶
Vector similarity is a fast filter, not a perfect ranker.
A common pattern: 1) retrieve top 50–200 candidates (fast) 2) rerank to top 6–10 using a stronger model (cross-encoder or LLM)
Reranking helps when: - the correct chunk is in the candidate set but not in the top‑K - your content is repetitive and hard to distinguish
Two-stage retrieval pattern:
# Stage 1: fetch 100 candidates with ANN (fast)
candidates = retrieve(query, k=100)
# Stage 2: rerank to top 8 with Cohere (accurate)
import cohere
co = cohere.ClientV2()
results = co.rerank(
model=”rerank-v3.5”,
query=query,
documents=[c[“content”] for c in candidates],
top_n=8,
)
reranked = [candidates[r.index] for r in results.results]
!!! tip “Full reranking tutorial” See Reranking Retrieved Results for local cross-encoder options, a latency comparison table, and how to measure impact using the eval harness.
6) Query rewriting / multi-query¶
If the user query is ambiguous or too short, rewrite it.
Two practical patterns: - Rewrite: “Expand this query into a search query for our docs…” - Multi-query: generate 3–5 variants, retrieve for each, then merge + dedupe
Only do this when you see recall issues; it increases latency and token usage.
Runnable rewrite function:
from openai import OpenAI
client = OpenAI()
def rewrite_query(original_query: str) -> str:
“””Expand an ambiguous query into a better standalone search query.”””
resp = client.chat.completions.create(
model=”gpt-4o-mini”,
temperature=0,
messages=[
{
“role”: “system”,
“content”: (
“You are a search query optimizer. “
“Rewrite the user's query into a clear, specific search query “
“for a technical documentation system. “
“Return only the rewritten query, nothing else.”
),
},
{“role”: “user”, “content”: original_query},
],
)
return resp.choices[0].message.content.strip()
# Example
original = “why is it slow?”
better = rewrite_query(original)
print(better)
# → “What are the common causes of slow retrieval performance in a RAG system?”
Debug checklist (“why retrieval looks wrong?”)¶
1) Inspect the top retrieved chunks (before prompting). 2) Check if parsing added boilerplate noise. 3) Check chunk size: too small/too big? 4) Add metadata filters (product/version/source). 5) Add per-doc caps or MMR if results are duplicates. 6) Add hybrid retrieval if exact keywords are missed. 7) Consider reranking if “almost right” results keep showing up.
Next steps¶
- Make answers grounded and cite sources: Prompt Engineering for RAG
- Measure improvements objectively: Evaluating RAG