Skip to content

Prompt Engineering for RAG

Once retrieval works, the next common failure mode is: the model ignores your context or invents details.

Good RAG prompting is about 3 things:

  1. Grounding: “use only the provided context”
  2. Formatting: make context easy to read (chunk IDs, clear separators)
  3. Failure modes: when the answer isn’t present, say so

Prompt roles: system vs user

  • System prompt: stable rules and behavior (grounding, tone, citation requirements).
  • User prompt: the variable input (the question + retrieved context).

Keep the system prompt short and strict. Put the retrieved context in the user message so it’s obviously “input data”.


Canonical BuildRag prompt template

This is a strong default for most Q&A style RAG:

You are a helpful assistant. Answer the user's question using ONLY the provided context.

Rules:
- If the context does not contain the answer, say: "I don't know based on the provided context."
- Do not use outside knowledge.
- When you use a fact from the context, cite it in this format: [source: {source}#chunk:{chunk_id}]
- If multiple chunks support the answer, cite multiple sources.
- Keep the answer concise and correct.

Context formatting patterns

The model performs better when each chunk is: - clearly separated - labeled with chunk_id + source - optionally labeled with section info (heading path)

Example context formatting:

--- BEGIN CONTEXT ---

[chunk_id=128 source="docs/handbook.md" section="Pricing > Enterprise"]
...chunk text...

[chunk_id=245 source="https://example.com/docs/api" section="Auth"]
...chunk text...

--- END CONTEXT ---

Reference code (OpenAI chat + citations)

from __future__ import annotations

from openai import OpenAI


client = OpenAI()

SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY the provided context.

Rules:
- If the context does not contain the answer, say: "I don't know based on the provided context."
- Do not use outside knowledge.
- When you use a fact from the context, cite it in this format: [source: {source}#chunk:{chunk_id}]
- Keep the answer concise and correct.
"""


def format_context(chunks: list[dict]) -> str:
    parts: list[str] = ["--- BEGIN CONTEXT ---"]
    for c in chunks:
        section = c.get("section_path") or ""
        section_part = f' section="{section}"' if section else ""
        parts.append(
            f'\n[chunk_id={c["id"]} source="{c["source"]}"{section_part}]\n{c["content"]}'
        )
    parts.append("\n--- END CONTEXT ---")
    return "\n".join(parts)


def answer_question(question: str, *, chunks: list[dict]) -> str:
    ctx = format_context(chunks)
    user_content = f"Question:\n{question}\n\nContext:\n{ctx}"

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_content},
        ],
    )
    return resp.choices[0].message.content or ""

Token/cost control knobs

These are the most impactful knobs to keep costs down without hurting quality:

  • cap retrieved chunks: k=6–10
  • cap per document: max 2–3 chunks per doc_id
  • dedupe near-identical chunks
  • truncate very long chunks (or re-chunk more aggressively)

If you still need more context, prefer improving retrieval (filters, hybrid search, reranking) over dumping more text.


Optional: frameworks (callouts)

LangChain

  • Keep your system prompt as a constant SystemMessage
  • Build a context formatter that attaches chunk IDs and sources

LlamaIndex

  • Use node metadata (node_id, source) and enforce a citation format in your prompt template


Multi-turn extension

For chat-style RAG, the messages list grows with each turn. The shape changes from a single exchange to a full conversation history:

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    # Turn 1
    {"role": "user",      "content": "Question:\nWhat is SSO?\n\nContext:\n...chunks..."},
    {"role": "assistant", "content": "SSO stands for Single Sign-On. [source: auth.md#chunk:3]"},
    # Turn 2 — context is FRESHLY RETRIEVED for the rewritten query
    {"role": "user",      "content": "Question:\nWhich plans include it?\n\nContext:\n...chunks..."},
]

Key rules: - Do NOT append raw retrieved chunks to the history — they bloat the context fast. - Do NOT append the system prompt again on each turn. - Rewrite the follow-up query into a standalone query before retrieval so the search is self-contained.

Full multi-turn tutorial

See Conversational RAG (Multi-Turn) for a complete implementation including query rewriting, conversation buffering, and session persistence.


Next steps