Prompt Engineering for RAG¶

Once retrieval works, the next common failure mode is: the model ignores your context or invents details.

Good RAG prompting is about 3 things:

Grounding: “use only the provided context”
Formatting: make context easy to read (chunk IDs, clear separators)
Failure modes: when the answer isn’t present, say so

Prompt roles: system vs user¶

System prompt: stable rules and behavior (grounding, tone, citation requirements).
User prompt: the variable input (the question + retrieved context).

Keep the system prompt short and strict. Put the retrieved context in the user message so it’s obviously “input data”.

Canonical BuildRag prompt template¶

This is a strong default for most Q&A style RAG:

You are a helpful assistant. Answer the user's question using ONLY the provided context.

Rules:
- If the context does not contain the answer, say: "I don't know based on the provided context."
- Do not use outside knowledge.
- When you use a fact from the context, cite it in this format: [source: {source}#chunk:{chunk_id}]
- If multiple chunks support the answer, cite multiple sources.
- Keep the answer concise and correct.

Context formatting patterns¶

The model performs better when each chunk is: - clearly separated - labeled with chunk_id + source - optionally labeled with section info (heading path)

Example context formatting:

--- BEGIN CONTEXT ---

[chunk_id=128 source="docs/handbook.md" section="Pricing > Enterprise"]
...chunk text...

[chunk_id=245 source="https://example.com/docs/api" section="Auth"]
...chunk text...

--- END CONTEXT ---

Reference code (OpenAI chat + citations)¶

from __future__ import annotations

from openai import OpenAI


client = OpenAI()

SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY the provided context.

Rules:
- If the context does not contain the answer, say: "I don't know based on the provided context."
- Do not use outside knowledge.
- When you use a fact from the context, cite it in this format: [source: {source}#chunk:{chunk_id}]
- Keep the answer concise and correct.
"""


def format_context(chunks: list[dict]) -> str:
    parts: list[str] = ["--- BEGIN CONTEXT ---"]
    for c in chunks:
        section = c.get("section_path") or ""
        section_part = f' section="{section}"' if section else ""
        parts.append(
            f'\n[chunk_id={c["id"]} source="{c["source"]}"{section_part}]\n{c["content"]}'
        )
    parts.append("\n--- END CONTEXT ---")
    return "\n".join(parts)


def answer_question(question: str, *, chunks: list[dict]) -> str:
    ctx = format_context(chunks)
    user_content = f"Question:\n{question}\n\nContext:\n{ctx}"

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_content},
        ],
    )
    return resp.choices[0].message.content or ""

Token/cost control knobs¶

These are the most impactful knobs to keep costs down without hurting quality:

cap retrieved chunks: k=6–10
cap per document: max 2–3 chunks per doc_id
dedupe near-identical chunks
truncate very long chunks (or re-chunk more aggressively)

If you still need more context, prefer improving retrieval (filters, hybrid search, reranking) over dumping more text.

Optional: frameworks (callouts)¶

LangChain

Keep your system prompt as a constant SystemMessage
Build a context formatter that attaches chunk IDs and sources

LlamaIndex

Use node metadata (node_id, source) and enforce a citation format in your prompt template

Multi-turn extension¶

For chat-style RAG, the messages list grows with each turn. The shape changes from a single exchange to a full conversation history:

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    # Turn 1
    {"role": "user",      "content": "Question:\nWhat is SSO?\n\nContext:\n...chunks..."},
    {"role": "assistant", "content": "SSO stands for Single Sign-On. [source: auth.md#chunk:3]"},
    # Turn 2 — context is FRESHLY RETRIEVED for the rewritten query
    {"role": "user",      "content": "Question:\nWhich plans include it?\n\nContext:\n...chunks..."},
]

Key rules: - Do NOT append raw retrieved chunks to the history — they bloat the context fast. - Do NOT append the system prompt again on each turn. - Rewrite the follow-up query into a standalone query before retrieval so the search is self-contained.

Full multi-turn tutorial

See Conversational RAG (Multi-Turn) for a complete implementation including query rewriting, conversation buffering, and session persistence.

Next steps¶

Improve retrieval quality (so the prompt sees better context): Retrieval Strategies
Measure improvements objectively: Evaluating RAG