Retrieval Strategies (Beyond Top‑K)¶

Most RAG systems start with: 1) embed the query
2) retrieve top‑K chunks by vector similarity
3) paste those chunks into a prompt

That works, but “top‑K” alone often fails in predictable ways: - you get many chunks from the same doc (low diversity) - you miss exact keywords (IDs, error codes) - you retrieve near-duplicates (wasted context) - you retrieve the right chunks but in the wrong order or missing context

This page gives a practical “retrieval ladder” you can climb as your system grows.

The retrieval ladder¶

1) Top‑K dense retrieval (baseline)
2) Metadata filters (scope the search)
3) Diversity (MMR / per-doc caps) (reduce duplicates)
4) Hybrid retrieval (FTS + vectors) (catch exact keywords)
5) Reranking (sort candidates by a stronger model)
6) Query rewriting / multi-query (improve recall)

You don’t need all of them. Add the next rung only when you see a clear failure mode.

1) Top‑K dense retrieval¶

Baseline pgvector query:

SELECT id, doc_id, section_path, content
FROM chunks
ORDER BY embedding <=> $1::vector
LIMIT 8;

Good defaults - start with k=6–10 - keep chunks ~200–400 tokens (see: Chunking)

2) Metadata filters (huge win)¶

Filters improve both quality and latency because you search a smaller space.

Examples:

-- Only retrieve from a specific product
SELECT id, content
FROM chunks
WHERE metadata @> '{"product":"enterprise"}'::jsonb
ORDER BY embedding <=> $1::vector
LIMIT 8;

-- Only retrieve from a subset of sources
SELECT id, content
FROM chunks
WHERE metadata->>'source_type' = 'docs'
ORDER BY embedding <=> $1::vector
LIMIT 8;

3) Diversity: per-doc caps and MMR¶

Per-document cap (simple and effective)¶

If your top results come from one long document, cap results per doc_id.

One SQL approach:

WITH ranked AS (
  SELECT
    id,
    doc_id,
    content,
    row_number() OVER (PARTITION BY doc_id ORDER BY embedding <=> $1::vector) AS doc_rnk,
    embedding <=> $1::vector AS distance
  FROM chunks
)
SELECT id, doc_id, content
FROM ranked
WHERE doc_rnk <= 2
ORDER BY distance
LIMIT 8;

MMR (Maximal Marginal Relevance)¶

MMR trades off: - relevance to the query - novelty vs already-selected chunks

MMR is typically done in application code (not SQL). High-level pseudocode:

selected = []
while len(selected) < k:
  pick chunk that maximizes:
    lambda * sim(query, chunk) - (1-lambda) * max(sim(chunk, s) for s in selected)

Start with lambda=0.7.

4) Hybrid retrieval (dense + keyword)¶

Use hybrid retrieval when users type: - exact names (“OAuth”, “SAML”, “HIPAA”) - error codes (“E11000”, “403”) - IDs (“INV-1029”)

Postgres makes hybrid retrieval easy. A strong default is RRF fusion:

See: SQL for RAG

5) Reranking¶

Vector similarity is a fast filter, not a perfect ranker.

A common pattern: 1) retrieve top 50–200 candidates (fast) 2) rerank to top 6–10 using a stronger model (cross-encoder or LLM)

Reranking helps when: - the correct chunk is in the candidate set but not in the top‑K - your content is repetitive and hard to distinguish

Two-stage retrieval pattern:

# Stage 1: fetch 100 candidates with ANN (fast)
candidates = retrieve(query, k=100)

# Stage 2: rerank to top 8 with Cohere (accurate)
import cohere
co = cohere.ClientV2()

results = co.rerank(
    model=”rerank-v3.5”,
    query=query,
    documents=[c[“content”] for c in candidates],
    top_n=8,
)

reranked = [candidates[r.index] for r in results.results]

!!! tip “Full reranking tutorial” See Reranking Retrieved Results for local cross-encoder options, a latency comparison table, and how to measure impact using the eval harness.

6) Query rewriting / multi-query¶

If the user query is ambiguous or too short, rewrite it.

Two practical patterns: - Rewrite: “Expand this query into a search query for our docs…” - Multi-query: generate 3–5 variants, retrieve for each, then merge + dedupe

Only do this when you see recall issues; it increases latency and token usage.

Runnable rewrite function:

from openai import OpenAI

client = OpenAI()

def rewrite_query(original_query: str) -> str:
    “””Expand an ambiguous query into a better standalone search query.”””
    resp = client.chat.completions.create(
        model=”gpt-4o-mini”,
        temperature=0,
        messages=[
            {
                “role”: “system”,
                “content”: (
                    “You are a search query optimizer. “
                    “Rewrite the user's query into a clear, specific search query “
                    “for a technical documentation system. “
                    “Return only the rewritten query, nothing else.”
                ),
            },
            {“role”: “user”, “content”: original_query},
        ],
    )
    return resp.choices[0].message.content.strip()


# Example
original = “why is it slow?”
better = rewrite_query(original)
print(better)
# → “What are the common causes of slow retrieval performance in a RAG system?”

Debug checklist (“why retrieval looks wrong?”)¶

1) Inspect the top retrieved chunks (before prompting). 2) Check if parsing added boilerplate noise. 3) Check chunk size: too small/too big? 4) Add metadata filters (product/version/source). 5) Add per-doc caps or MMR if results are duplicates. 6) Add hybrid retrieval if exact keywords are missed. 7) Consider reranking if “almost right” results keep showing up.

Next steps¶

Make answers grounded and cite sources: Prompt Engineering for RAG
Measure improvements objectively: Evaluating RAG