Evaluation (Eval)
Definition
Systematic, reproducible measurement of an AI system's quality against defined criteria — typically using test datasets, metrics (accuracy, F1, BLEU, domain-specific scores), human-in-the-loop evaluation or LLM-as-judge approaches.
Noise — Signal
"We tested the model" and "we evaluated the model" are not the same thing. Tests check whether the system runs. Evaluation measures whether it does the right thing — continuously, with documented datasets, defined metrics and thresholds at which a model is rolled back. The majority of AI initiatives that fail in production have no real eval infrastructure, because it was prioritised as "later". It does not arrive later.
The right question
Not: "Does our model work?" But: "Which eval datasets, metrics and acceptance thresholds did we define before go-live, who checks them continuously, and what is the trigger for rollback or model change?"