What is WildClawBench?

WildClawBench is an in-the-wild benchmark by InternLM with 60 human-authored, bilingual, multimodal tasks that tests AI agents in a live OpenClaw environment. Each task requires ~8 minutes and 20+ tool calls.

Which agent harnesses does WildClawBench test?

The May 2026 release tests four harnesses: OpenClaw, Claude Code, Codex CLI, and Hermes Agent, running the same 60-task suite under each scaffold.

What is the highest score on WildClawBench?

Claude Opus 4.7 holds the top score at 62.2%. No other model exceeds 60%. Scores span a 43-point range from 19.3% to 62.2%.

What does WildClawBench reveal about agent capabilities?

The benchmark exposes a massive gap between agent demos and real-world capability. Even top models fail nearly 40% of practical tasks involving multi-step workflows, email negotiation, code debugging, and media editing.

How does WildClawBench run its tests?

Each task runs inside a reproducible Docker container hosting an actual CLI agent harness with access to real tools rather than mock services, ensuring results reflect genuine capability.

← Back to dashboard

clawsmith.com/signal/wildclawbench-multi-harness-agent-benchmark-62pct-ceiling

📈 TrendsWide OpenLive

WildClawBench: Even Claude Opus 4.7 Only Scores 62.2% on 60 Real-World Agent Tasks Across 4 Harnesses

InternLM WildClawBench tests OpenClaw, Claude Code, Codex CLI, and Hermes Agent on 60 real-world tasks requiring ~8 min and 20+ tool calls each. Top model Claude Opus 4.7 hits 62.2%, no other exceeds 60%. 43-point score spread reveals massive gap between agent demos and actual capability. 418 GitHub stars.

Product Idea from this Signal

A CI/CD service that runs real-world task benchmarks against your OpenClaw agent before every config or skill change ships to production

458 ▲

WildClawBench (458 GitHub stars) showed that even Claude Opus 4.7 only scores 62.2% on 60 real-world agent tasks. Research found a 37% gap between lab benchmarks and real-world deployment performance. Today, teams push agent config changes (model swaps, skill updates, prompt edits) to production blind. No CI step catches regressions. Confident AI, LangSmith, and Braintrust offer evaluation tooling but none run as a git-triggered CI job against a live OpenClaw staging instance with real tool execution. The gap is a service that spins up a sandboxed OpenClaw instance, runs a task suite against it, and blocks the deploy if pass rate drops below threshold.

CI-CDBENCHMARKINGAGENT-EVALUATIONDEVTOOLOPEN-SOURCE

CompetitiveView Opportunity →