Skip to content

InvisibleBench scan profiles — scripts/run_scan.py and LLMVerifier

Key findings used in wiki

  • Four named scan profiles introduced: smoke (deterministic-only), dev (A/B/F buckets, K=1 adaptive), full (all checks, K=1 adaptive), publish (all checks, configured K).
  • ScanPlan data structure introduced for explicit pre-flight cost estimation before scoring starts; --dry-run writes scan_plan.json and cost_report.json without calling any model.
  • LLMVerifier updated with adaptive_repetitions: a clear PASS or NOT_APPLICABLE on the first verifier call stops the repetition budget early, reducing unnecessary judge calls in development scans.
  • Scorer cache (use_cache=True) enabled on LLM verifier calls for deterministic judge results.
  • --scenario-parallel N flag added to runner for concurrent scenario execution within a single model benchmark run.
  • Data flow updated: scenario JSON → RunPlan → harness → Transcript → ScanPlan → check execution → results → leaderboard.