Skip to content

Evaluating RAG (Retrieval + Answer Quality)

If you don’t evaluate RAG, you end up guessing: - “Did chunking help?” - “Is hybrid search better?” - “Did the prompt change actually improve groundedness?”

Evaluation doesn’t need to be fancy. A small, honest test set plus a repeatable script is enough to drive big improvements.


What to evaluate

Retrieval quality (before the LLM)

Common signals: - Recall@K: did you retrieve at least one “correct” chunk in the top K? - MRR: how early does the first correct result appear? - Diversity: are results all from the same doc?

Answer quality (after the LLM)

Common signals: - Groundedness / faithfulness: does the answer only use the context? - Relevance: does it answer the question? - Citation correctness: do citations point to supporting chunks?


Build a minimal test set (questions.jsonl)

Create a small file, e.g. questions.jsonl:

{"id":"pricing_sso","question":"Which plan supports SSO?","expected_sources":["pricing.md"],"notes":"Should mention Enterprise only."}
{"id":"reset_password","question":"How do I reset my password?","expected_sources":["account.md"]}

Tips: - Start with 20–50 questions. - Include both “easy” and “hard” questions. - Add ambiguous questions you expect users to ask.

If you don’t know the exact chunk IDs yet, start by using expected_sources (file names / URLs) as the supervision signal.


A minimal retrieval eval harness

This script assumes you have a retrieve(question, k) function that returns chunks with a source field.

from __future__ import annotations

import json
from pathlib import Path
from typing import Any


def load_jsonl(path: str) -> list[dict[str, Any]]:
    items: list[dict[str, Any]] = []
    for line in Path(path).read_text(encoding="utf-8").splitlines():
        line = line.strip()
        if not line:
            continue
        items.append(json.loads(line))
    return items


def recall_at_k(retrieved_sources: list[str], expected_sources: list[str]) -> float:
    expected = {s.lower() for s in expected_sources}
    got = {s.lower() for s in retrieved_sources}
    return 1.0 if expected.intersection(got) else 0.0


def evaluate_retrieval(questions_path: str, *, k: int = 8) -> None:
    qs = load_jsonl(questions_path)
    scores: list[float] = []

    for q in qs:
        question = q["question"]
        expected_sources = q.get("expected_sources") or []

        # TODO: replace with your actual retrieval function
        retrieved = retrieve(question, k=k)  # noqa: F821

        retrieved_sources = [c["source"] for c in retrieved]
        if expected_sources:
            score = recall_at_k(retrieved_sources, expected_sources)
            scores.append(score)

        print("Q:", question)
        print("Retrieved sources:", retrieved_sources[:5])
        print("---")

    if scores:
        print(f"Recall@{k}: {sum(scores) / len(scores):.3f} ({len(scores)} labeled questions)")
    else:
        print("No labeled questions found (missing expected_sources).")

Save results for run comparison

Append each eval run to a JSONL file so you can compare runs over time:

import json
from datetime import datetime, timezone
from pathlib import Path


def save_results(
    scores: list[float],
    *,
    k: int,
    output_path: str = "eval_results.jsonl",
    run_label: str = "",
) -> None:
    record = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "run_label": run_label,
        "k": k,
        "n": len(scores),
        "recall_at_k": round(sum(scores) / len(scores), 4) if scores else None,
    }
    with open(output_path, "a", encoding="utf-8") as f:
        f.write(json.dumps(record) + "\n")
    print(f"Saved → {output_path}: recall@{k}={record['recall_at_k']}")

Usage — pass scores from evaluate_retrieval and compare runs by diffing eval_results.jsonl.


LLM-as-judge (optional, use carefully)

LLM judges can help you scale qualitative checks (groundedness/citations), but: - they can be noisy - they can be biased by your prompt

Use this function to get structured scores instead of copy-pasting a rubric:

import json
from openai import OpenAI

client = OpenAI()

JUDGE_SYSTEM = """You are an evaluation judge for a RAG system.
Rate the answer on three dimensions, each 0-2:
  0 = fails, 1 = partially meets, 2 = fully meets

Return ONLY valid JSON in this format:
{"groundedness": 0-2, "relevance": 0-2, "citation_correctness": 0-2, "reason": "..."}

Definitions:
- groundedness: answer uses only the provided context (no invented facts)
- relevance: answer addresses the question
- citation_correctness: citations [source: ...] point to chunks that support the claim"""


def judge_answer(
    question: str,
    context: str,
    answer: str,
) -> dict:
    """Return groundedness/relevance/citation scores (0-2 each)."""
    user_content = (
        f"Question: {question}\n\nContext:\n{context}\n\nAnswer:\n{answer}"
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user", "content": user_content},
        ],
    )
    return json.loads(resp.choices[0].message.content)


# Example
scores = judge_answer(
    question="Which plan supports SSO?",
    context="[chunk_id=1 source='pricing.md']\nEnterprise plan includes SSO.",
    answer="The Enterprise plan supports SSO. [source: pricing.md#chunk:1]",
)
print(scores)
# → {"groundedness": 2, "relevance": 2, "citation_correctness": 2, "reason": "..."}

Always keep a small set of human-reviewed examples as a reality check.


Run retrieval eval as a pytest test

Gate quality in CI with a single pytest assertion:

# tests/test_retrieval_eval.py
import pytest
from rag.retrieve import retrieve  # your retrieve() function


QUESTIONS_PATH = "tests/fixtures/questions.jsonl"
RECALL_THRESHOLD = 0.80


def test_recall_at_8():
    from eval import load_jsonl, recall_at_k  # the functions above

    qs = load_jsonl(QUESTIONS_PATH)
    labeled = [q for q in qs if q.get("expected_sources")]
    scores = []
    for q in labeled:
        retrieved = retrieve(q["question"], k=8)
        sources = [c["source"] for c in retrieved]
        scores.append(recall_at_k(sources, q["expected_sources"]))

    recall = sum(scores) / len(scores)
    assert recall >= RECALL_THRESHOLD, (
        f"Recall@8 = {recall:.3f} < threshold {RECALL_THRESHOLD}"
    )

Run with:

uv run pytest tests/test_retrieval_eval.py -v

Iteration loop (the only thing that matters)

1) Change one thing (chunk size, overlap, filters, hybrid, prompt) 2) Re-run retrieval eval 3) Spot-check answers on a small subset 4) Keep changes that improve metrics and reduce obvious failures


Next steps