Live Benchmarking

Browser Agent
Test Suites

Head-to-head comparison of autonomous web agents — tested on real websites with real tasks. Success rates, costs, and failure modes, all independently verified.

language

Browser

LIVE

7 agents · WebVoyager benchmark

local_fire_department

Scraping

Coming soon

search

Search

Coming soon

mic

Voice AI

Coming soon

bolt

Inference

Coming soon

Efficiency vs. Cost

Success rate plotted against cost per task

TOP PERFORMERS
ALL AGENTS
100% 75% 50% 25% 0%
$0.01 $0.04 $0.07 $0.10 $0.15
BrowserUse v3: 94.2% / $0.042
Stagehand: 91.4% / $0.051
BrowserUse v2: 89.8% / $0.058
MultiOn: 88.1% / $0.067
Skyvern: 85.3% / $0.081
HyperBrowser: 82.7% / $0.039
Operator: 79.6% / $0.120
insights

ClawScore Overview

How deployment-ready each agent is on OpenClaw — setup friction, tool compatibility, and uptime stability.

Top ClawScore

92 BrowserUse v3

Where Agents Struggle

Auth-gated flows 44%
Multi-step checkout 61%
Dynamic dropdowns 68%
Simple navigation 93%

Avg. success rate across all agents by task type

Benchmark Suites

public

WebVoyager

He et al., 2024

ACTIVE

The most widely used benchmark for web agents. 643 tasks across 15 real websites (Google Flights, Amazon, GitHub, etc.) covering navigation, form filling, data extraction, and multi-step workflows.

Tasks

643

Sites

15

Runs

5

neurology

Mind2Web

Deng et al., 2023

COMING SOON

2,000+ tasks across 137 websites with annotated action sequences. Tests generalization to unseen sites and complex, multi-step interactions — a harder test of real-world readiness.

Tasks

2,000+

Sites

137

Status

Q3 2025

WebVoyager Results

All agents tested on identical task sets with deterministic seeds. Only BrowserUse v3 has a published report — other results release as reports are finalized.

Agent Success Rate Cost / Task Latency Error Recovery ClawScore
BU
BrowserUse v3

GPT-4o backbone

94.2% $0.042 1.2s 87% 92
SH
Stagehand

Browserbase

91.4% $0.051 1.8s 81% 88
B2
BrowserUse v2

Claude 3.5 Sonnet

89.8% $0.058 2.4s 72% 84
MO
MultiOn Agent

MultiOn (hosted)

88.1% $0.067 3.1s 76% 81
SK
Skyvern

Vision-first agent

85.3% $0.081 4.6s 69% 74
HB
HyperBrowser

Cloud browser infra

82.7% $0.039 2.8s 64% 71
OP
Operator

OpenAI CUA model

79.6% $0.120 5.2s 58% 63

How We Test

replay

Reproducible Runs

Every agent runs the same tasks with pinned seeds and configs. 5 independent runs per agent to measure variance. All configs are open-source — fork and verify.

gavel

Automated Grading

Task success is evaluated by comparing final page state against ground-truth criteria — not self-reported by the agent. Human spot-checks validate edge cases.

balance

Neutral & Independent

We don't accept payment for rankings or report placement. Agents are tested on equal footing — same hardware, same network conditions, same evaluation criteria.

Want your agent benchmarked?

We'll run it through our eval harness and publish a full report — same methodology, same standards as every other agent on the platform.