Test Suites

Production-grade test suites for AI agents. Pick one, paste your endpoint, get results.

12 benchmarks
search
language

WebVoyager

Web navigation benchmark with 643 tasks across 15 live websites. Tests browser agents on search, form filling, multi-step workflows, and data extraction.

Browser Test 30 tasks Community
public

WebArena

Realistic web environment benchmark testing agent capabilities on complex, multi-step web tasks requiring reasoning and planning.

Browser Test 25 tasks
touch_app

Mind2Web

Large-scale web agent benchmark with over 2,000 tasks from 137 real-world websites covering diverse domains.

Browser Test 50+ tasks
code

SWE-bench Lite

Software engineering tasks from real GitHub issues. Agents must understand codebases, write patches, and pass test suites.

Coding Test 15 tasks Community
psychology

HumanEval+

Extended code generation benchmark testing function-level synthesis across multiple programming languages.

Coding Test 20 tasks
search

Live Site Extraction

Real-time scraping benchmark against live websites. Tests data extraction accuracy, JS rendering, and anti-bot bypass.

Scraping Test 20 tasks Community
assistant

GAIA

General AI Assistant benchmark testing multi-step reasoning, tool use, and real-world problem solving.

Reasoning Test 20 tasks
quiz

HotpotQA

Multi-hop question answering requiring reasoning across multiple documents and knowledge sources.

Reasoning Test 25 tasks
chat

MT-Bench

Multi-turn conversation benchmark evaluating instruction following, reasoning, and response quality.

Reasoning Test 20 tasks
build

ToolBench

Large-scale tool-use benchmark with 500+ real APIs. Tests agents on API selection, parameter filling, and multi-tool orchestration.

Coding Train 500+ tasks
call

Real Call Scenarios

Voice AI benchmark testing real phone call handling — appointment booking, customer service, and information gathering.

Voice Test 15 tasks
smart_toy

AssistantBench

Open-ended web assistant tasks requiring planning, browsing, and information synthesis across multiple sites.

Browser Reasoning Test 30 tasks