Evaluating RAG (Retrieval + Answer Quality)¶
If you don’t evaluate RAG, you end up guessing: - “Did chunking help?” - “Is hybrid search better?” - “Did the prompt change actually improve groundedness?”
Evaluation doesn’t need to be fancy. A small, honest test set plus a repeatable script is enough to drive big improvements.
What to evaluate¶
Retrieval quality (before the LLM)¶
Common signals: - Recall@K: did you retrieve at least one “correct” chunk in the top K? - MRR: how early does the first correct result appear? - Diversity: are results all from the same doc?
Answer quality (after the LLM)¶
Common signals: - Groundedness / faithfulness: does the answer only use the context? - Relevance: does it answer the question? - Citation correctness: do citations point to supporting chunks?
Build a minimal test set (questions.jsonl)¶
Create a small file, e.g. questions.jsonl:
{"id":"pricing_sso","question":"Which plan supports SSO?","expected_sources":["pricing.md"],"notes":"Should mention Enterprise only."}
{"id":"reset_password","question":"How do I reset my password?","expected_sources":["account.md"]}
Tips: - Start with 20–50 questions. - Include both “easy” and “hard” questions. - Add ambiguous questions you expect users to ask.
If you don’t know the exact chunk IDs yet, start by using expected_sources (file names / URLs) as the supervision signal.
A minimal retrieval eval harness¶
This script assumes you have a retrieve(question, k) function that returns chunks with a source field.
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
def load_jsonl(path: str) -> list[dict[str, Any]]:
items: list[dict[str, Any]] = []
for line in Path(path).read_text(encoding="utf-8").splitlines():
line = line.strip()
if not line:
continue
items.append(json.loads(line))
return items
def recall_at_k(retrieved_sources: list[str], expected_sources: list[str]) -> float:
expected = {s.lower() for s in expected_sources}
got = {s.lower() for s in retrieved_sources}
return 1.0 if expected.intersection(got) else 0.0
def evaluate_retrieval(questions_path: str, *, k: int = 8) -> None:
qs = load_jsonl(questions_path)
scores: list[float] = []
for q in qs:
question = q["question"]
expected_sources = q.get("expected_sources") or []
# TODO: replace with your actual retrieval function
retrieved = retrieve(question, k=k) # noqa: F821
retrieved_sources = [c["source"] for c in retrieved]
if expected_sources:
score = recall_at_k(retrieved_sources, expected_sources)
scores.append(score)
print("Q:", question)
print("Retrieved sources:", retrieved_sources[:5])
print("---")
if scores:
print(f"Recall@{k}: {sum(scores) / len(scores):.3f} ({len(scores)} labeled questions)")
else:
print("No labeled questions found (missing expected_sources).")
Save results for run comparison¶
Append each eval run to a JSONL file so you can compare runs over time:
import json
from datetime import datetime, timezone
from pathlib import Path
def save_results(
scores: list[float],
*,
k: int,
output_path: str = "eval_results.jsonl",
run_label: str = "",
) -> None:
record = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"run_label": run_label,
"k": k,
"n": len(scores),
"recall_at_k": round(sum(scores) / len(scores), 4) if scores else None,
}
with open(output_path, "a", encoding="utf-8") as f:
f.write(json.dumps(record) + "\n")
print(f"Saved → {output_path}: recall@{k}={record['recall_at_k']}")
Usage — pass scores from evaluate_retrieval and compare runs by diffing eval_results.jsonl.
LLM-as-judge (optional, use carefully)¶
LLM judges can help you scale qualitative checks (groundedness/citations), but: - they can be noisy - they can be biased by your prompt
Use this function to get structured scores instead of copy-pasting a rubric:
import json
from openai import OpenAI
client = OpenAI()
JUDGE_SYSTEM = """You are an evaluation judge for a RAG system.
Rate the answer on three dimensions, each 0-2:
0 = fails, 1 = partially meets, 2 = fully meets
Return ONLY valid JSON in this format:
{"groundedness": 0-2, "relevance": 0-2, "citation_correctness": 0-2, "reason": "..."}
Definitions:
- groundedness: answer uses only the provided context (no invented facts)
- relevance: answer addresses the question
- citation_correctness: citations [source: ...] point to chunks that support the claim"""
def judge_answer(
question: str,
context: str,
answer: str,
) -> dict:
"""Return groundedness/relevance/citation scores (0-2 each)."""
user_content = (
f"Question: {question}\n\nContext:\n{context}\n\nAnswer:\n{answer}"
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": JUDGE_SYSTEM},
{"role": "user", "content": user_content},
],
)
return json.loads(resp.choices[0].message.content)
# Example
scores = judge_answer(
question="Which plan supports SSO?",
context="[chunk_id=1 source='pricing.md']\nEnterprise plan includes SSO.",
answer="The Enterprise plan supports SSO. [source: pricing.md#chunk:1]",
)
print(scores)
# → {"groundedness": 2, "relevance": 2, "citation_correctness": 2, "reason": "..."}
Always keep a small set of human-reviewed examples as a reality check.
Run retrieval eval as a pytest test¶
Gate quality in CI with a single pytest assertion:
# tests/test_retrieval_eval.py
import pytest
from rag.retrieve import retrieve # your retrieve() function
QUESTIONS_PATH = "tests/fixtures/questions.jsonl"
RECALL_THRESHOLD = 0.80
def test_recall_at_8():
from eval import load_jsonl, recall_at_k # the functions above
qs = load_jsonl(QUESTIONS_PATH)
labeled = [q for q in qs if q.get("expected_sources")]
scores = []
for q in labeled:
retrieved = retrieve(q["question"], k=8)
sources = [c["source"] for c in retrieved]
scores.append(recall_at_k(sources, q["expected_sources"]))
recall = sum(scores) / len(scores)
assert recall >= RECALL_THRESHOLD, (
f"Recall@8 = {recall:.3f} < threshold {RECALL_THRESHOLD}"
)
Run with:
uv run pytest tests/test_retrieval_eval.py -v
Iteration loop (the only thing that matters)¶
1) Change one thing (chunk size, overlap, filters, hybrid, prompt) 2) Re-run retrieval eval 3) Spot-check answers on a small subset 4) Keep changes that improve metrics and reduce obvious failures
Next steps¶
- Improve retrieval knobs systematically: Retrieval Strategies
- Use Postgres hybrid retrieval: SQL for RAG