Skip to content

Canonical source

The authoritative methodology documentation lives at https://givecareapp.github.io/givecare-bench/methodology/. This page is a wiki-context summary. When details here diverge from the canonical docs site, the docs site is correct.

InvisibleBench Methodology

InvisibleBench is a conversation benchmark for evaluating whether caregiver-support AI stays inside the public safety and scope contract and delivers useful support once it clears that bar.

This page explains what the benchmark measures, why those dimensions were chosen, and how the runtime methodology has evolved.

What InvisibleBench is trying to measure

The benchmark is built around a simple idea:

A caregiver-support model must first be safe and in-bounds. Only then does it make sense to compare quality.

That is why InvisibleBench uses a gate-then-quality architecture rather than a single blended score.

Current public claim surface

InvisibleBench now makes a narrower and stronger public claim than a generic "overall AI quality" benchmark.

Strongest public claims

The benchmark is currently strongest as a calibrated public-red-line benchmark for:

  • safety
  • compliance
  • public hard-fail rates

Those are the parts of the system most appropriate to read as the benchmark's primary public accountability surface.

More cautious secondary claims

The benchmark also reports:

  • communication
  • coordination
  • boundary integrity
  • overall score

These remain useful for comparison, but they should be read more cautiously. The methodology has evolved toward stronger judge validation on the hard-fail layer first, while the quality layer remains secondary to the safety and compliance gates.

Scoring architecture

flowchart TD
    A["Response"] --> B{"Gate 1: Safety"}
    B -->|Fail| C["Score = 0.0"]
    B -->|Pass| D{"Gate 2: Compliance"}
    D -->|Fail| C
    D -->|Pass| E["Quality scoring"]
    E --> F["Communication"]
    E --> G["Coordination"]
    E --> H["Boundary integrity"]
    F --> I["Overall score"]
    G --> I
    H --> I

Gate 1: Safety

The safety gate asks whether the model:

  • recognizes crisis signals,
  • avoids harmful guidance,
  • escalates appropriately,
  • and does not continue normal conversation when a safety response is required.

Gate 2: Compliance

The compliance gate asks whether the model stays within the scope of caregiver support rather than drifting into prohibited clinical or deceptive behavior.

Public hard fails include:

  • diagnosis,
  • patient-specific prescribing or treatment directives,
  • and false scope or capability claims.

Quality layer: Communication + Coordination + Boundary

If both gates pass, the benchmark then evaluates quality.

  • Communication asks whether the model treats the caregiver with dignity, attunement, respect for agency, and trauma-informed language.
  • Coordination asks whether the model reduces logistical burden through concrete, navigable support.
  • Boundary integrity asks whether the model avoids anthropomorphism, dependency cues, false memory, and false self-representation.

How the runtime methodology has evolved

InvisibleBench is no longer best described as a purely deterministic rubric with light model scoring on top. The current runtime is better understood as:

LLM-backed scoring, governed by verifier-style decomposition and calibrated first on the public hard-fail layer.

In practice, that means:

  1. Deterministic guardrails catch bright-line failures and preserve allowed behavior.
  2. LLM-backed safety and compliance scorers adjudicate semantic edge cases.
  3. Scan profiles separate cheap development feedback from strict publication scoring. Four named profiles — smoke (deterministic only), dev (safety/compliance/boundary buckets, K=1 adaptive), full (all checks, K=1 adaptive), and publish (all checks, configured K) — apply different check-set filters and verifier repetition budgets to the same engine. A --dry-run flag writes scan_plan.json and cost_report.json without calling any model, giving an explicit pre-flight cost estimate before expensive scoring starts.15
  4. Scorer behavior is audited against human-labelled benchmark material so the public hard-fail layer is not just prompt-shaped guesswork.
  5. Communication, coordination, and boundary integrity remain secondary quality signals, while safety and compliance remain the strongest public accountability surface.

That evolution matters because caregiver-support evaluation lives in gray zones. A benchmark that only checks keywords will miss the real work. A benchmark that uses only an unconstrained LLM judge will drift. The current methodology is trying to take the strengths of both.

That multi-turn emphasis is not only a safety preference. Recent conversation research shows that models degrade when requirements are revealed incrementally, often locking into early assumptions and struggling to recover; related work also frames part of the failure as an intent-alignment gap between what users mean and what models infer67. Drift-Bench reinforces this by demonstrating that cooperative breakdowns occur even in non-adversarial settings where users are cooperative — failure is not limited to adversarial scenarios10. For InvisibleBench, that supports scenarios that reveal need, risk, and context over time rather than only testing single-turn prompt compliance.

The methodology also draws on memory-evaluation research. ENGRAM introduces typed evaluation across episodic, semantic, and procedural memory categories and demonstrates that procedural memory — how to do things — is the weakest category, directly relevant to caregiving task guidance11. THEANINE shows that temporal ordering significantly affects retrieval accuracy and that models struggle to maintain correct chronological relationships across long conversations12. Together, these inform InvisibleBench's evaluation of context retention and temporal consistency in extended caregiving dialogues.

Grounding layers

InvisibleBench draws authority from five complementary layers.

Layer Function What it contributes
Invisible risk Anthropomorphism, emotional entanglement, confabulation Keeps the benchmark focused on failure modes specific to companion-like AI systems1
Behavioral safety Crisis routing, safe boundaries, not-therapy norms Grounds the benchmark in public mental-health-adjacent safety expectations2
Patient voice What people actually want from AI support Pushes the benchmark toward continuity, contextual safety, and explicit boundaries3
Caregiver realism Real caregiver conditions and support infrastructure Keeps scenarios tied to actual caregiver life rather than abstract chatbot prompts89
Regulatory floor Scope and disclosure expectations Ensures evaluation reflects the public compliance surface, not just soft quality preferences

Why these dimensions were chosen

Safety

The safety layer is heavily shaped by the finding that many models fail indirect or context-dependent crisis signals45.

The LLM verifier runs K-repetition majority vote. Publication scans use the per-check configured K (typically K=3). Development scans use K=1 with adaptive early exit — a clear PASS or NOT_APPLICABLE on the first pass stops the repetition budget without calling the judge again. The adaptive verifier also enables the in-process scorer cache for deterministic judge calls.15

That is why the benchmark emphasizes:

  • ambiguous crisis cues,
  • gradual escalation,
  • multi-turn drift,
  • and masked or indirect risk expression.

Compliance

The compliance layer exists because caregiver-support AI has to stay inside a bright public boundary: support, orientation, and practical guidance are allowed; diagnosis, patient-specific clinical direction, and false professional posture are not. The DSM-5-TR establishes the bright line between clinical diagnosis — which requires licensed professional judgment — and colloquial description of symptoms, providing the authoritative taxonomy InvisibleBench uses to test whether AI companions avoid crossing into diagnostic territory13. ICD-11 code QD85 classifies burnout as an occupational phenomenon explicitly distinct from mental disorders, informing boundary scenarios that test whether AI systems correctly frame caregiver exhaustion as occupational rather than clinical14.

Communication

Communication exists because a model can technically avoid hard fails while still treating the caregiver badly — flattening them, paternalizing them, or missing the human meaning of what they said. The current Communication dimension carries the earlier regard work forward into dignity, recognition, agency, and trauma-informed language.

Coordination

Coordination exists because caregiver support is not only emotional. A strong system has to help people take the next useful step, name real supports, and reduce navigation burden.

Boundary integrity

Boundary integrity exists because caregiver-support AI can become unsafe even when it sounds warm and helpful. A strong system must avoid human identity claims, emotional dependency cues, false memory, and promises of availability or capability it does not actually have.

How to read benchmark results now

The most honest public reading is:

  • Safety and compliance are the benchmark's strongest current public accountability claims.
  • Communication, coordination, boundary integrity, and overall score are informative, but still less calibration-mature than the hard-fail layer.
  • The benchmark is strongest when used to answer: who stays inside the public safety/scope contract, how often, and on which kinds of failure?

It is not yet equally strong as a final authority on every close quality ordering between two models that already perform similarly on the hard-fail layer.

What this methodology does not claim

InvisibleBench is a conversation benchmark, not a full product audit.

It does not claim to fully measure:

  • product-level privacy and security,
  • app design or notification strategy,
  • sensitive-disclosure minimization outside conversation behavior,
  • or every possible downstream real-world outcome.

It also does not publish every judge prompt, threshold, or verifier detail. The public methodology is meant to explain the benchmark's logic and public claim surface, not make the benchmark easy to game.


  1. NIST. "AI 600-1." Source → 

  2. NAMI & Dr. John Torous. "AI Evaluation: 5 Criteria." 2026. Source → 

  3. National Health Council. "Patient Voice Report." 2026. Source → 

  4. Rosebud AI. "CARE Framework." Source → 

  5. Cheng et al. "Slow Drift of Support." 2026. Source → 

  6. Laban et al. "Lost in Conversation: Long-Context Unreliability in LLMs." arXiv:2505.06120, 2025. Source → 

  7. Liu et al. "Intent Mismatch in Multi-Turn Conversations." arXiv:2602.07338, 2026. Source → 

  8. AARP/NAC. "Caregiving in the United States 2025." Source → 

  9. ACL. "2024 Report to Congress on the 2022 National Strategy to Support Family Caregivers." Source → 

  10. "Drift-Bench: Cooperative Breakdowns in Conversational AI." arXiv:2602.02455, 2026. Source → 

  11. "ENGRAM: Episodic/Semantic/Procedural Typed Memory Evaluation." arXiv:2511.12960, 2025. Source → 

  12. "THEANINE: Timeline-Based Memory Retention." arXiv:2406.10996, 2024. Source → 

  13. American Psychiatric Association. "DSM-5-TR." 2022. Source → 

  14. World Health Organization. "ICD-11, QD85: Burnout." 2022. Source → 

  15. givecare-bench scripts/run_scan.py and src/invisiblebench/evaluation/verifiers/llm_verifier.py @ commit 1dff29f (2026-05-20). Scan profiles added: smoke, dev, full, publish. Adaptive repetition and pre-flight ScanPlan cost estimation introduced.