Understanding Embeddings (for RAG)¶
Embeddings turn text into a list of numbers (a vector). Similar texts produce vectors that are “close” in that vector space.
In RAG, you typically embed: - chunks of your documents (during ingestion) - the user query (at runtime)
Then you search for the most similar chunk vectors and send those chunks to the LLM.
What embeddings are (and aren’t)¶
Embeddings are great for: - synonyms and paraphrases (“cost” vs “pricing”) - fuzzy concept matching (“SSO” vs “single sign-on”) - finding relevant passages without exact keyword overlap
Embeddings are not great for: - exact numeric computations - strict filters and aggregations (use SQL for that) - retrieving new facts that aren’t in your data
Similarity: cosine vs dot product¶
The two most common similarity measures:
- Cosine similarity: compares the angle between vectors (direction)
- Dot product: depends on both direction and magnitude
Most RAG systems use cosine similarity or dot product. What matters most is: be consistent between: - how you index/search in your vector store - how you interpret “closest” results
Model selection (OpenAI-first)¶
A practical default:
- text-embedding-3-small for most RAG use cases (cost-effective)
- text-embedding-3-large when you need higher quality and can pay more
Open-source alternatives
If you need local embeddings, popular families include bge and e5. The trade-off is more infra and tuning, but you can avoid sending data to a hosted API.
Production embedding tips¶
- Batch inputs to reduce overhead.
- Retry on transient failures (rate limits / timeouts).
- Cache embeddings keyed by
(model, normalized_text_hash)so you don’t re-embed unchanged content. - Version embeddings: store
embedding_modelalongside each chunk so you can re-embed safely later. - If you change models, update your pgvector schema dimension (e.g.
vector(1536)).
Hash-based embedding cache (inline snippet):
import hashlib, json
from pathlib import Path
CACHE_DIR = Path(".cache/embeddings")
CACHE_DIR.mkdir(parents=True, exist_ok=True)
def _cache_key(model: str, text: str) -> str:
return hashlib.sha256(f"{model}:{text}".encode()).hexdigest()
def embed_with_cache(text: str, *, model: str = "text-embedding-3-small") -> list[float]:
key = _cache_key(model, text)
path = CACHE_DIR / f"{key}.json"
if path.exists():
return json.loads(path.read_text())
embedding = embed_texts([text], model=model)[0]
path.write_text(json.dumps(embedding))
return embedding
The cache is content-addressed: the same (model, text) always hits the same file. Safe to use across runs.
Full caching tutorial
See Caching for RAG for a Postgres-based cache that works across multiple processes.
Code: generate embeddings + cosine similarity¶
Install¶
uv pip install openai
Embed texts¶
from __future__ import annotations
import math
from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str], *, model: str = "text-embedding-3-small") -> list[list[float]]:
resp = client.embeddings.create(model=model, input=texts)
# Preserve input order
return [item.embedding for item in resp.data]
def cosine_similarity(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
return dot / (na * nb)
texts = [
"How do I reset my password?",
"How can I change my account password?",
"How do I deploy the service to production?",
]
embs = embed_texts(texts)
print("similarity(0,1) =", cosine_similarity(embs[0], embs[1]))
print("similarity(0,2) =", cosine_similarity(embs[0], embs[2]))
You should see that (0,1) is more similar than (0,2).
Next steps¶
- Store and search embeddings with Postgres: Vector Stores for RAG (Postgres + pgvector)
- Improve retrieval quality: Retrieval Strategies (Beyond Top‑K)