Skip to content

Retrieval Strategies (Beyond Top‑K)

Most RAG systems start with: 1) embed the query
2) retrieve top‑K chunks by vector similarity
3) paste those chunks into a prompt

That works, but “top‑K” alone often fails in predictable ways: - you get many chunks from the same doc (low diversity) - you miss exact keywords (IDs, error codes) - you retrieve near-duplicates (wasted context) - you retrieve the right chunks but in the wrong order or missing context

This page gives a practical “retrieval ladder” you can climb as your system grows.


The retrieval ladder

1) Top‑K dense retrieval (baseline)
2) Metadata filters (scope the search)
3) Diversity (MMR / per-doc caps) (reduce duplicates)
4) Hybrid retrieval (FTS + vectors) (catch exact keywords)
5) Reranking (sort candidates by a stronger model)
6) Query rewriting / multi-query (improve recall)

You don’t need all of them. Add the next rung only when you see a clear failure mode.


1) Top‑K dense retrieval

Baseline pgvector query:

SELECT id, doc_id, section_path, content
FROM chunks
ORDER BY embedding <=> $1::vector
LIMIT 8;

Good defaults - start with k=6–10 - keep chunks ~200–400 tokens (see: Chunking)


2) Metadata filters (huge win)

Filters improve both quality and latency because you search a smaller space.

Examples:

-- Only retrieve from a specific product
SELECT id, content
FROM chunks
WHERE metadata @> '{"product":"enterprise"}'::jsonb
ORDER BY embedding <=> $1::vector
LIMIT 8;
-- Only retrieve from a subset of sources
SELECT id, content
FROM chunks
WHERE metadata->>'source_type' = 'docs'
ORDER BY embedding <=> $1::vector
LIMIT 8;

3) Diversity: per-doc caps and MMR

Per-document cap (simple and effective)

If your top results come from one long document, cap results per doc_id.

One SQL approach:

WITH ranked AS (
  SELECT
    id,
    doc_id,
    content,
    row_number() OVER (PARTITION BY doc_id ORDER BY embedding <=> $1::vector) AS doc_rnk,
    embedding <=> $1::vector AS distance
  FROM chunks
)
SELECT id, doc_id, content
FROM ranked
WHERE doc_rnk <= 2
ORDER BY distance
LIMIT 8;

MMR (Maximal Marginal Relevance)

MMR trades off: - relevance to the query - novelty vs already-selected chunks

MMR is typically done in application code (not SQL). High-level pseudocode:

selected = []
while len(selected) < k:
  pick chunk that maximizes:
    lambda * sim(query, chunk) - (1-lambda) * max(sim(chunk, s) for s in selected)

Start with lambda=0.7.


4) Hybrid retrieval (dense + keyword)

Use hybrid retrieval when users type: - exact names (“OAuth”, “SAML”, “HIPAA”) - error codes (“E11000”, “403”) - IDs (“INV-1029”)

Postgres makes hybrid retrieval easy. A strong default is RRF fusion:

See: SQL for RAG


5) Reranking

Vector similarity is a fast filter, not a perfect ranker.

A common pattern: 1) retrieve top 50–200 candidates (fast) 2) rerank to top 6–10 using a stronger model (cross-encoder or LLM)

Reranking helps when: - the correct chunk is in the candidate set but not in the top‑K - your content is repetitive and hard to distinguish

Two-stage retrieval pattern:

# Stage 1: fetch 100 candidates with ANN (fast)
candidates = retrieve(query, k=100)

# Stage 2: rerank to top 8 with Cohere (accurate)
import cohere
co = cohere.ClientV2()

results = co.rerank(
    model=rerank-v3.5,
    query=query,
    documents=[c[content] for c in candidates],
    top_n=8,
)

reranked = [candidates[r.index] for r in results.results]

!!! tip “Full reranking tutorial” See Reranking Retrieved Results for local cross-encoder options, a latency comparison table, and how to measure impact using the eval harness.


6) Query rewriting / multi-query

If the user query is ambiguous or too short, rewrite it.

Two practical patterns: - Rewrite: “Expand this query into a search query for our docs…” - Multi-query: generate 3–5 variants, retrieve for each, then merge + dedupe

Only do this when you see recall issues; it increases latency and token usage.

Runnable rewrite function:

from openai import OpenAI

client = OpenAI()

def rewrite_query(original_query: str) -> str:
    “””Expand an ambiguous query into a better standalone search query.”””
    resp = client.chat.completions.create(
        model=gpt-4o-mini,
        temperature=0,
        messages=[
            {
                role: system,
                content: (
                    You are a search query optimizer. 
                    Rewrite the user's query into a clear, specific search query “
                    for a technical documentation system. 
                    Return only the rewritten query, nothing else.
                ),
            },
            {role: user, content: original_query},
        ],
    )
    return resp.choices[0].message.content.strip()


# Example
original = why is it slow?”
better = rewrite_query(original)
print(better)
# → “What are the common causes of slow retrieval performance in a RAG system?”

Debug checklist (“why retrieval looks wrong?”)

1) Inspect the top retrieved chunks (before prompting). 2) Check if parsing added boilerplate noise. 3) Check chunk size: too small/too big? 4) Add metadata filters (product/version/source). 5) Add per-doc caps or MMR if results are duplicates. 6) Add hybrid retrieval if exact keywords are missed. 7) Consider reranking if “almost right” results keep showing up.


Next steps