Our Methodology
Stop benchmarking how well an agent can think. Start benchmarking how cheaply, reliably, and collaboratively it can work in a world that is constantly breaking and changing.
The Status Quo
The problem with current benchmarks
Current LLM benchmarks -- MMLU, GSM8K, HumanEval -- and first-generation agent benchmarks -- WebArena, SWE-bench, AgentBench -- are static, single-turn, and outcome-based. They ask one question: "Did the agent get the right answer?"
The next generation of benchmarks asks something harder and more useful: "How well did the agent work, and was it worth it?" That shift -- from outcome to process, from static to dynamic, from single-agent to multi-agent -- is what we are building toward.
Five Dimensions
What next-generation benchmarks measure
Each dimension addresses a blind spot in today's evaluation landscape.
Dynamic, Evolving Environments
The Problem
Agents that memorize fixed API calls or web pages aren't intelligent. Real workflows involve shifting documentation, updated APIs, and changing business rules.
Our Approach
Controlled Drift -- the benchmark environment intentionally changes mid-evaluation. A database schema gets a new column. An API endpoint is deprecated. The agent must detect the change, adapt, and re-plan without human intervention.
Key Metric
Adaptability Quotient (AQ) -- performance degradation after a change, and recovery time.
Collaborative & Competitive Workflows
The Problem
Agents that can't delegate, critique, or negotiate have limited utility in business processes.
Our Approach
Multi-Agent Sandboxes with role-based evaluation. A "PM" agent, "Developer" agent, and "QA" agent collaborate on a pull request. A "Procurement" agent negotiates with a simulated "Vendor" agent.
Key Metrics
Collaboration Efficiency -- tokens/turns to consensus, Negotiation Win Rate, Critique Quality.
Process & Cost, Not Just Outcome
The Problem
SWE-bench checks if a patch passes tests. It doesn't care if the agent tried 100,000 API calls costing $50. It's trivial to brute-force a solution. It's hard to make one that is elegant, fast, and cheap.
Our Approach
Resource-Constrained Evaluation. Every agent gets a budget -- monetary cost (e.g., $1.00), latency (under 30s), compute steps (max 50 LLM calls). Perfect answer at 100x the cost loses.
Key Metric
Cost-Adjusted F1 = (F1 Score) / (Total Cost x Latency)
Multi-Dimensional Scorecards
The Problem
"Accuracy" on a QA task ignores whether the agent was rude, leaked PII, or took a nonsensical path. A single number obscures what matters.
Our Approach
Profile-Based Scoring. Each task defines a weighted scorecard. Financial Trading: 0.5 Profit + 0.3 Risk Management + 0.2 Compliance. Customer Support: 0.6 Resolution + 0.3 Sentiment - 0.2 Escalation.
Key Metric
Weighted Goal Score -- apples-to-apples comparison for agents optimized for different roles.
Goal & Constraint Ambiguity
The Problem
Current benchmarks give clear instructions. Real workflows are ambiguous. Agents that can't ask clarifying questions or infer intent are brittle.
Our Approach
Under-Specified Tasks. The initial prompt is vague or contradictory. The agent must identify ambiguity, generate clarifying questions, and choose the most efficient resolution path.
Key Metrics
Question Efficiency -- right 2 questions vs. 20 wrong ones -- and Assumption Reasonableness.
In Development
Benchmarks under development
| Name | Core Focus | Key Differentiator |
|---|---|---|
| GAIA-2 | Multi-tool reasoning | Adds tool failure injection -- APIs randomly return errors |
| AgentNet | Dynamic web navigation | Website HTML changes every 24 hours -- no hardcoding |
| CodeWorkflow | Software engineering | Multi-agent collaboration: design, code, test, deploy |
| BargainBench | Negotiation | Two LLM agents bargain -- scored on persuasiveness and fairness |
| EcoAgent | Cost optimization | Analyze 10GB logs -- score = insight / cost |
The Endgame
The Ultimate Metric
Every benchmark we build ladders up to one question: what is the return on AI investment for this workflow? Not accuracy in a vacuum. Not pass rates on contrived tasks. The ROI of the agent in production.
Formula
ROAI = (Value of Task Completed - Cost of Agent) / Time to Completion
A perfect benchmark lets an engineering manager answer: "Should I replace my $60k/year junior developer with this $0.50/hour agent workflow, given the 15% risk of catastrophic failure?"
That is the question worth answering. Everything else is academic.
Ready to benchmark what matters?
We're building the evaluation infrastructure the agent ecosystem is missing. If you're building agents or deploying them, we want to talk.