Skip to content

Understanding Embeddings (for RAG)

Embeddings turn text into a list of numbers (a vector). Similar texts produce vectors that are “close” in that vector space.

In RAG, you typically embed: - chunks of your documents (during ingestion) - the user query (at runtime)

Then you search for the most similar chunk vectors and send those chunks to the LLM.


What embeddings are (and aren’t)

Embeddings are great for: - synonyms and paraphrases (“cost” vs “pricing”) - fuzzy concept matching (“SSO” vs “single sign-on”) - finding relevant passages without exact keyword overlap

Embeddings are not great for: - exact numeric computations - strict filters and aggregations (use SQL for that) - retrieving new facts that aren’t in your data


Similarity: cosine vs dot product

The two most common similarity measures:

  • Cosine similarity: compares the angle between vectors (direction)
  • Dot product: depends on both direction and magnitude

Most RAG systems use cosine similarity or dot product. What matters most is: be consistent between: - how you index/search in your vector store - how you interpret “closest” results


Model selection (OpenAI-first)

A practical default: - text-embedding-3-small for most RAG use cases (cost-effective) - text-embedding-3-large when you need higher quality and can pay more

Open-source alternatives

If you need local embeddings, popular families include bge and e5. The trade-off is more infra and tuning, but you can avoid sending data to a hosted API.


Production embedding tips

  • Batch inputs to reduce overhead.
  • Retry on transient failures (rate limits / timeouts).
  • Cache embeddings keyed by (model, normalized_text_hash) so you don’t re-embed unchanged content.
  • Version embeddings: store embedding_model alongside each chunk so you can re-embed safely later.
  • If you change models, update your pgvector schema dimension (e.g. vector(1536)).

Hash-based embedding cache (inline snippet):

import hashlib, json
from pathlib import Path

CACHE_DIR = Path(".cache/embeddings")
CACHE_DIR.mkdir(parents=True, exist_ok=True)

def _cache_key(model: str, text: str) -> str:
    return hashlib.sha256(f"{model}:{text}".encode()).hexdigest()

def embed_with_cache(text: str, *, model: str = "text-embedding-3-small") -> list[float]:
    key = _cache_key(model, text)
    path = CACHE_DIR / f"{key}.json"
    if path.exists():
        return json.loads(path.read_text())
    embedding = embed_texts([text], model=model)[0]
    path.write_text(json.dumps(embedding))
    return embedding

The cache is content-addressed: the same (model, text) always hits the same file. Safe to use across runs.

Full caching tutorial

See Caching for RAG for a Postgres-based cache that works across multiple processes.


Code: generate embeddings + cosine similarity

Install

uv pip install openai

Embed texts

from __future__ import annotations

import math
from openai import OpenAI


client = OpenAI()


def embed_texts(texts: list[str], *, model: str = "text-embedding-3-small") -> list[list[float]]:
    resp = client.embeddings.create(model=model, input=texts)
    # Preserve input order
    return [item.embedding for item in resp.data]


def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb)


texts = [
    "How do I reset my password?",
    "How can I change my account password?",
    "How do I deploy the service to production?",
]

embs = embed_texts(texts)
print("similarity(0,1) =", cosine_similarity(embs[0], embs[1]))
print("similarity(0,2) =", cosine_similarity(embs[0], embs[2]))

You should see that (0,1) is more similar than (0,2).


Next steps