Same task. Different systems. Measured.

Run one task through different agent setups and see which is most reliable — and what it costs. Click any system to change its model, instructions, tools, or how its agents coordinate.

History

Your task

What does a correct answer look like? optional

Each system runs your task several times. A judge model grades every answer — reliability is how often it passes.

Systems under test

tap a system to include or exclude it

Single agentsingle agentGemini 3 Flash

One model, one pass. No second agent checks the answer.

Configure →

Sequential reviewteam2 agents

Your prompt goes to the first agent; its answer is passed to the next to check and improve.

Configure →

+ Add a single agent + Add a team

Each system runs the task several times. Scored on reliability (how often it's right) and cost.

Experiment tracks · the summer program

Three kinds of experiment a team can take on. Pick one, run it here, share what you find.

01runnable now

Known-answer

Problems with a ground truth, so any team can verify the result independently.

· One typo. Who catches it?
· The attack: an instruction hidden in the document

02open to every team

Bring your own problem

Put your own task or agent system under test. A judge model grades each answer against what you define as correct.

· Use “test your own task” above

03forming

Public-impact

A high-visibility experiment on a real public problem, like wildfire or coastal-safety risk, with regional partners.

· In design — bring a problem set