Test Suites
Production-grade test suites for AI agents. Pick one, paste your endpoint, get results.
WebVoyager
Web navigation benchmark with 643 tasks across 15 live websites. Tests browser agents on search, form filling, multi-step workflows, and data extraction.
WebArena
Realistic web environment benchmark testing agent capabilities on complex, multi-step web tasks requiring reasoning and planning.
Mind2Web
Large-scale web agent benchmark with over 2,000 tasks from 137 real-world websites covering diverse domains.
SWE-bench Lite
Software engineering tasks from real GitHub issues. Agents must understand codebases, write patches, and pass test suites.
HumanEval+
Extended code generation benchmark testing function-level synthesis across multiple programming languages.
Live Site Extraction
Real-time scraping benchmark against live websites. Tests data extraction accuracy, JS rendering, and anti-bot bypass.
GAIA
General AI Assistant benchmark testing multi-step reasoning, tool use, and real-world problem solving.
HotpotQA
Multi-hop question answering requiring reasoning across multiple documents and knowledge sources.
MT-Bench
Multi-turn conversation benchmark evaluating instruction following, reasoning, and response quality.
ToolBench
Large-scale tool-use benchmark with 500+ real APIs. Tests agents on API selection, parameter filling, and multi-tool orchestration.
Real Call Scenarios
Voice AI benchmark testing real phone call handling — appointment booking, customer service, and information gathering.
AssistantBench
Open-ended web assistant tasks requiring planning, browsing, and information synthesis across multiple sites.