EVALUATIONSJournal

Evaluations without the vibes.

Every team we meet ships their first LLM feature on gut feel. None of them ship their fifth that way. A practical guide to evaluating an LLM feature like it matters — because once it is in production, it does.

Tech Nerve AI/AI / ML teamMarch 2026·9 min

There is a category of AI product that ships, gets lots of traffic, and quietly degrades every week without anyone noticing. The team shipped on vibes, has no eval harness, and only finds out about regressions when a user tweets a screenshot. This is the most common failure mode for GenAI features — more common than hallucination, more common than latency, more common than cost.

Evals are not tests

Unit tests ask: did this function return the correct value? Evals ask: is the distribution of outputs from this feature good enough for the users we are shipping it to? These are different questions, need different tooling, and need different mental models.

Write them anyway. An LLM feature without an eval harness is a feature your engineers cannot reason about and your business cannot commit to.

Start with a gold set

Fifty queries. Real ones, pulled from your support inbox, your user interviews, your product usage logs. For each, an ideal output — written by the person who knows the domain best, not the engineer trying to ship. Fifty is enough to feel a regression. Fifty is few enough that a human can read every output.

NoteIf fifty feels like a lot, that is the point. An LLM feature that cannot be described by fifty examples is a feature whose scope has not been defined.

Three graders, not one

No single grader catches everything. We typically run three in sequence:

01Deterministic checks — length, format, presence of required fields, JSON validity, citation count. Cheap, fast, catches most production regressions.
02LLM-as-judge — another model scores the output against the reference on criteria you care about (accuracy, tone, completeness). Rubric in the prompt, not vibes. Score a calibration set first to confirm the judge agrees with a human 85%+ of the time.
03Human review — a rotating sample of fifty live outputs per week, read by someone who knows the domain. Not optional. The failures that kill products are the ones your automated grader did not know to look for.

Wire evals to CI

Every prompt change, every model version, every context-window tweak runs against the gold set before merge. Regressions block the PR. This is not overkill; it is the minimum bar. Prompts are code. They should live in a repo, be reviewed like code, be tested like code.

“If your prompt changes ship without evals, your prompt changes are changes to a production binary made by whoever had the terminal open last.”

Evaluate the system, not the model

When your pipeline is retrieval → rerank → generate → critique, any one component can silently regress. Evaluate end-to-end output (that is what the user sees) but also each stage — retrieval recall at k, rerank top-1, critique agreement rate. Without stage-level evals you will spend weeks chasing an end-to-end regression that was a retrieval-recall drop caused by a vendor quietly changing an embedding model.

Production evals are different

Offline evals live in CI. Online evals live in production, sampling live traffic, grading with the same LLM-as-judge rubric, reporting to a dashboard your oncall actually looks at. When a vendor updates their model, you will see it on this dashboard before you see it in a Slack complaint. That is the whole game.

Tagged

Evaluations
LLMOps
Quality
Testing