Understanding Embeddings (for RAG)¶

Embeddings turn text into a list of numbers (a vector). Similar texts produce vectors that are “close” in that vector space.

In RAG, you typically embed: - chunks of your documents (during ingestion) - the user query (at runtime)

Then you search for the most similar chunk vectors and send those chunks to the LLM.

What embeddings are (and aren’t)¶

Embeddings are great for: - synonyms and paraphrases (“cost” vs “pricing”) - fuzzy concept matching (“SSO” vs “single sign-on”) - finding relevant passages without exact keyword overlap

Embeddings are not great for: - exact numeric computations - strict filters and aggregations (use SQL for that) - retrieving new facts that aren’t in your data

Similarity: cosine vs dot product¶

The two most common similarity measures:

Cosine similarity: compares the angle between vectors (direction)
Dot product: depends on both direction and magnitude

Most RAG systems use cosine similarity or dot product. What matters most is: be consistent between: - how you index/search in your vector store - how you interpret “closest” results

Model selection (OpenAI-first)¶

A practical default: - text-embedding-3-small for most RAG use cases (cost-effective) - text-embedding-3-large when you need higher quality and can pay more

Open-source alternatives

If you need local embeddings, popular families include bge and e5. The trade-off is more infra and tuning, but you can avoid sending data to a hosted API.

Production embedding tips¶

Batch inputs to reduce overhead.
Retry on transient failures (rate limits / timeouts).
Cache embeddings keyed by (model, normalized_text_hash) so you don’t re-embed unchanged content.
Version embeddings: store embedding_model alongside each chunk so you can re-embed safely later.
If you change models, update your pgvector schema dimension (e.g. vector(1536)).

Hash-based embedding cache (inline snippet):

import hashlib, json
from pathlib import Path

CACHE_DIR = Path(".cache/embeddings")
CACHE_DIR.mkdir(parents=True, exist_ok=True)

def _cache_key(model: str, text: str) -> str:
    return hashlib.sha256(f"{model}:{text}".encode()).hexdigest()

def embed_with_cache(text: str, *, model: str = "text-embedding-3-small") -> list[float]:
    key = _cache_key(model, text)
    path = CACHE_DIR / f"{key}.json"
    if path.exists():
        return json.loads(path.read_text())
    embedding = embed_texts([text], model=model)[0]
    path.write_text(json.dumps(embedding))
    return embedding

The cache is content-addressed: the same (model, text) always hits the same file. Safe to use across runs.

Full caching tutorial

See Caching for RAG for a Postgres-based cache that works across multiple processes.

Code: generate embeddings + cosine similarity¶

Install¶

uv pip install openai

Embed texts¶

from __future__ import annotations

import math
from openai import OpenAI


client = OpenAI()


def embed_texts(texts: list[str], *, model: str = "text-embedding-3-small") -> list[list[float]]:
    resp = client.embeddings.create(model=model, input=texts)
    # Preserve input order
    return [item.embedding for item in resp.data]


def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb)


texts = [
    "How do I reset my password?",
    "How can I change my account password?",
    "How do I deploy the service to production?",
]

embs = embed_texts(texts)
print("similarity(0,1) =", cosine_similarity(embs[0], embs[1]))
print("similarity(0,2) =", cosine_similarity(embs[0], embs[2]))

You should see that (0,1) is more similar than (0,2).

Next steps¶

Store and search embeddings with Postgres: Vector Stores for RAG (Postgres + pgvector)
Improve retrieval quality: Retrieval Strategies (Beyond Top‑K)