Prompt Engineering for RAG¶
Once retrieval works, the next common failure mode is: the model ignores your context or invents details.
Good RAG prompting is about 3 things:
- Grounding: “use only the provided context”
- Formatting: make context easy to read (chunk IDs, clear separators)
- Failure modes: when the answer isn’t present, say so
Prompt roles: system vs user¶
- System prompt: stable rules and behavior (grounding, tone, citation requirements).
- User prompt: the variable input (the question + retrieved context).
Keep the system prompt short and strict. Put the retrieved context in the user message so it’s obviously “input data”.
Canonical BuildRag prompt template¶
This is a strong default for most Q&A style RAG:
You are a helpful assistant. Answer the user's question using ONLY the provided context.
Rules:
- If the context does not contain the answer, say: "I don't know based on the provided context."
- Do not use outside knowledge.
- When you use a fact from the context, cite it in this format: [source: {source}#chunk:{chunk_id}]
- If multiple chunks support the answer, cite multiple sources.
- Keep the answer concise and correct.
Context formatting patterns¶
The model performs better when each chunk is:
- clearly separated
- labeled with chunk_id + source
- optionally labeled with section info (heading path)
Example context formatting:
--- BEGIN CONTEXT ---
[chunk_id=128 source="docs/handbook.md" section="Pricing > Enterprise"]
...chunk text...
[chunk_id=245 source="https://example.com/docs/api" section="Auth"]
...chunk text...
--- END CONTEXT ---
Reference code (OpenAI chat + citations)¶
from __future__ import annotations
from openai import OpenAI
client = OpenAI()
SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY the provided context.
Rules:
- If the context does not contain the answer, say: "I don't know based on the provided context."
- Do not use outside knowledge.
- When you use a fact from the context, cite it in this format: [source: {source}#chunk:{chunk_id}]
- Keep the answer concise and correct.
"""
def format_context(chunks: list[dict]) -> str:
parts: list[str] = ["--- BEGIN CONTEXT ---"]
for c in chunks:
section = c.get("section_path") or ""
section_part = f' section="{section}"' if section else ""
parts.append(
f'\n[chunk_id={c["id"]} source="{c["source"]}"{section_part}]\n{c["content"]}'
)
parts.append("\n--- END CONTEXT ---")
return "\n".join(parts)
def answer_question(question: str, *, chunks: list[dict]) -> str:
ctx = format_context(chunks)
user_content = f"Question:\n{question}\n\nContext:\n{ctx}"
resp = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_content},
],
)
return resp.choices[0].message.content or ""
Token/cost control knobs¶
These are the most impactful knobs to keep costs down without hurting quality:
- cap retrieved chunks:
k=6–10 - cap per document: max 2–3 chunks per
doc_id - dedupe near-identical chunks
- truncate very long chunks (or re-chunk more aggressively)
If you still need more context, prefer improving retrieval (filters, hybrid search, reranking) over dumping more text.
Optional: frameworks (callouts)¶
LangChain
- Keep your system prompt as a constant
SystemMessage - Build a context formatter that attaches chunk IDs and sources
LlamaIndex
- Use node metadata (
node_id,source) and enforce a citation format in your prompt template
Multi-turn extension¶
For chat-style RAG, the messages list grows with each turn. The shape changes from a single exchange to a full conversation history:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
# Turn 1
{"role": "user", "content": "Question:\nWhat is SSO?\n\nContext:\n...chunks..."},
{"role": "assistant", "content": "SSO stands for Single Sign-On. [source: auth.md#chunk:3]"},
# Turn 2 — context is FRESHLY RETRIEVED for the rewritten query
{"role": "user", "content": "Question:\nWhich plans include it?\n\nContext:\n...chunks..."},
]
Key rules: - Do NOT append raw retrieved chunks to the history — they bloat the context fast. - Do NOT append the system prompt again on each turn. - Rewrite the follow-up query into a standalone query before retrieval so the search is self-contained.
Full multi-turn tutorial
See Conversational RAG (Multi-Turn) for a complete implementation including query rewriting, conversation buffering, and session persistence.
Next steps¶
- Improve retrieval quality (so the prompt sees better context): Retrieval Strategies
- Measure improvements objectively: Evaluating RAG