Embedding
Definition
A numeric vector representation of a text, image or other data object in a high-dimensional space, where semantically similar objects sit close together. The basis for semantic search, classification, clustering and RAG.
Noise — Signal
Embeddings are sold as "AI now understands meaning instead of just words". What they actually capture is the distribution of terms in the training corpus of the embedding model — not meaning in any epistemological sense. An embedding model trained on English web data performs worse on German technical text; a model from 2023 doesn't know the terminology of a 2025 regulatory text. The choice of embedding model is an architectural decision with consequences for search quality and compliance.
The right question
Not: "Which embedding model is state of the art?" But: "On which language, which domain and which time period is the model trained, and how do we measure whether the search quality on our actual content — not on benchmarks — meets the requirements?"