How it works
How the measurement works
The lab is a simple loop run carefully. Everything else is depth for the people who want it.
1 · A task with a known answer
We pick a real piece of work and write down the correct answers ourselves. That is the ground truth.
2 · Hidden errors
We seed the task with deliberate mistakes — the kind a careless reading would copy straight through.
3 · Run it under a condition
We run agents on it one way at a time: alone, checking each other, three tries and a vote, cheaper or stronger model.
4 · Score against the answer
We compare every field to the known answer, count how many hidden errors were caught, and tally the cost and time.
The conditions we vary
The same task, run different ways. These are standard agent patterns; the names in parentheses are the technical terms for the researchers.
Working alone
One agent, one pass, no second look. The baseline. (Chain-of-Thought)
Checking each other
One agent answers, a second re-checks and flags what is wrong. (Round-Robin Review)
Three tries and a vote
Run it independently a few times and take the majority answer. (Self-Consistency, CoT-SC)
Cheaper vs stronger model
Hold the pattern fixed and change the model, to separate “smarter model” from “better process.”
The console is the researcher’s control panel: pick any condition, run it, and compare runs side by side. The home page is the one-click version of the same engine.
What we measure
- Accuracy against the known answer, field by field.
- Errors caught — how many of the hidden mistakes the condition flagged or fixed. This is the heart of it: can the system catch its own mistakes?
- Cost and time to a finished answer. Reliability is only useful if you know what it costs. Value is reliability per dollar, not a single score.
- Reproducibility and audit. Every run replays from its saved configuration, and every decision is on the record. Trust is evidence you can re-check.
The open questions the lab settles with evidence
Does checking beat raw intelligence?
Can a cheaper model that checks its own work beat an expensive model that answers once? If so, oversight is a better lever than model size.
When does a manager help?
Does a coordinator directing workers beat a flat team that self-organizes — and at what size does structure start to pay off?
What is trust, measured?
Reproducibility, a complete audit trail, error-catch rate, and how fast a mistake can be caught and reversed — instead of one composite rating.
Contribute a task
Bring a task from your domain
The lab is strongest when the tasks come from real work. To add one, three things are needed:
- The task — a piece of work an agent should be able to do (a document to read, a reconciliation, a check to perform).
- The correct answers — what a careful expert would produce, so we can score against it.
- Where the hard parts are — the places a mistake is easy to make and costly to miss.
Bring one to the June 24 kickoff, or send it over and we’ll wire it in as a task card. Your results are co-authored with you.