Independent AI QA agent that generates and runs E2E browser tests from a PR diff using a different model than the one that wrote the code
When a developer uses Claude Code or Cursor to write a feature, asking the same model to generate tests reproduces the same blind spots: the AI tests what it built, not what was specified. AI-generated code has 1.7x more production issues than hand-written code, but AI-generated tests for AI-generated code create false confidence. The unserved gap is an independent model -- different provider, different weights, isolated context -- that reads only the PR diff and specification, generates E2E browser tests, and runs them against a preview deployment without any knowledge of how the code was implemented. Two YC companies (Canary W26, Ghostship S25) are in the space but HN comments show skepticism: developers ask why this differs from GitHub Copilot, and the answer -- model independence -- is not clearly delivered. The vibe coding quality gap has 137 HN points and 200+ comments (February 2026). BrowserStack publicly acknowledged QA teams spend 28 minutes per test failure while devs ship 33% faster with AI.
Score Breakdown
Social Proof 4 sources
Gap Assessment
Canary (YC W26, 58 HN pts) and Ghostship (YC S25, 53 pts) exist but neither clearly differentiates on model independence. The specific angle -- different model, isolated context, runs against preview deploy from diff alone -- is not productized. BrowserStack entering the space (June 2025 launch) signals enterprise validation. The solo-dev / indie SaaS market is un-served by enterprise tools.