Build the eval harness before the product

When we started building Helix Labs’ clinician copilot, the instinct on most teams is to start prompting and see what comes out. We did the opposite: we spent the first week writing the evaluation harness, before the product had a single working feature.

Why eval-first

LLM systems fail silently. A model can hallucinate a diagnosis code with the same confidence it states a correct one, and without a harness, you find out from a clinician instead of a test run. We built a few hundred labeled cases — real clinical transcripts, redacted and reviewed — before writing the retrieval pipeline that would answer them.

What the harness actually checked

Three things, in order of how much they mattered to the client: factual grounding against the source transcript, omission of required clinical fields, and tone drift over long sessions. Each prompt or retrieval change ran against the full suite before it touched staging.

The payoff

By the time we handed the product to clinicians for a pilot, we already knew its failure modes — and so did they, because we’d documented them. That’s a very different conversation than “we think it’s pretty good.” It’s also the only way we’d be comfortable putting our name on a system that touches patient care.