Framework

Our Methodology

Stop benchmarking how well an agent can think. Start benchmarking how cheaply, reliably, and collaboratively it can work in a world that is constantly breaking and changing.

The Status Quo

The problem with current benchmarks

Current LLM benchmarks -- MMLU, GSM8K, HumanEval -- and first-generation agent benchmarks -- WebArena, SWE-bench, AgentBench -- are static, single-turn, and outcome-based. They ask one question: "Did the agent get the right answer?"

The next generation of benchmarks asks something harder and more useful: "How well did the agent work, and was it worth it?" That shift -- from outcome to process, from static to dynamic, from single-agent to multi-agent -- is what we are building toward.

Five Dimensions

What next-generation benchmarks measure

Each dimension addresses a blind spot in today's evaluation landscape.

01

Dynamic, Evolving Environments

The Problem

Agents that memorize fixed API calls or web pages aren't intelligent. Real workflows involve shifting documentation, updated APIs, and changing business rules.

Our Approach

Controlled Drift -- the benchmark environment intentionally changes mid-evaluation. A database schema gets a new column. An API endpoint is deprecated. The agent must detect the change, adapt, and re-plan without human intervention.

speed

Key Metric

Adaptability Quotient (AQ) -- performance degradation after a change, and recovery time.

02

Collaborative & Competitive Workflows

The Problem

Agents that can't delegate, critique, or negotiate have limited utility in business processes.

Our Approach

Multi-Agent Sandboxes with role-based evaluation. A "PM" agent, "Developer" agent, and "QA" agent collaborate on a pull request. A "Procurement" agent negotiates with a simulated "Vendor" agent.

group

Key Metrics

Collaboration Efficiency -- tokens/turns to consensus, Negotiation Win Rate, Critique Quality.

03

Process & Cost, Not Just Outcome

The Problem

SWE-bench checks if a patch passes tests. It doesn't care if the agent tried 100,000 API calls costing $50. It's trivial to brute-force a solution. It's hard to make one that is elegant, fast, and cheap.

Our Approach

Resource-Constrained Evaluation. Every agent gets a budget -- monetary cost (e.g., $1.00), latency (under 30s), compute steps (max 50 LLM calls). Perfect answer at 100x the cost loses.

paid

Key Metric

Cost-Adjusted F1 = (F1 Score) / (Total Cost x Latency)

04

Multi-Dimensional Scorecards

The Problem

"Accuracy" on a QA task ignores whether the agent was rude, leaked PII, or took a nonsensical path. A single number obscures what matters.

Our Approach

Profile-Based Scoring. Each task defines a weighted scorecard. Financial Trading: 0.5 Profit + 0.3 Risk Management + 0.2 Compliance. Customer Support: 0.6 Resolution + 0.3 Sentiment - 0.2 Escalation.

dashboard

Key Metric

Weighted Goal Score -- apples-to-apples comparison for agents optimized for different roles.

05

Goal & Constraint Ambiguity

The Problem

Current benchmarks give clear instructions. Real workflows are ambiguous. Agents that can't ask clarifying questions or infer intent are brittle.

Our Approach

Under-Specified Tasks. The initial prompt is vague or contradictory. The agent must identify ambiguity, generate clarifying questions, and choose the most efficient resolution path.

help

Key Metrics

Question Efficiency -- right 2 questions vs. 20 wrong ones -- and Assumption Reasonableness.

In Development

Benchmarks under development

Name Core Focus Key Differentiator
GAIA-2 Multi-tool reasoning Adds tool failure injection -- APIs randomly return errors
AgentNet Dynamic web navigation Website HTML changes every 24 hours -- no hardcoding
CodeWorkflow Software engineering Multi-agent collaboration: design, code, test, deploy
BargainBench Negotiation Two LLM agents bargain -- scored on persuasiveness and fairness
EcoAgent Cost optimization Analyze 10GB logs -- score = insight / cost

The Endgame

The Ultimate Metric

Every benchmark we build ladders up to one question: what is the return on AI investment for this workflow? Not accuracy in a vacuum. Not pass rates on contrived tasks. The ROI of the agent in production.

Formula

ROAI = (Value of Task Completed - Cost of Agent) / Time to Completion

A perfect benchmark lets an engineering manager answer: "Should I replace my $60k/year junior developer with this $0.50/hour agent workflow, given the 15% risk of catastrophic failure?"

That is the question worth answering. Everything else is academic.

Ready to benchmark what matters?

We're building the evaluation infrastructure the agent ecosystem is missing. If you're building agents or deploying them, we want to talk.

Back to leaderboard →