Multi-Turn Safety Failures in Caregiver AI¶
Caregiving conversations are not single exchanges. A caregiver texting Mira does so over days, weeks, and months. The conversation accumulates context: a mother's diagnosis, a medication change, a benefit application started but not finished, a moment of despair at 2 AM. Every major AI safety evaluation tests single turns. The research synthesized here shows why that is dangerous and what breaks when conversations extend.
The scale of the problem¶
Six independent research efforts converge on the same conclusion: current language models fail in sustained conversations, and the failure modes are precisely the ones that matter most for caregiving.
Failure rate: 88%¶
Cheng et al. tested mental health chatbots across extended interactions and found an 88% failure rate1. Drift begins around turn 4-5 — the point where a chatbot has enough conversational history to lose track of it. Failure modes include topic drift (the model wanders from the user's concern), contradictory advice (the model recommends something it cautioned against three turns earlier), and missed safety signals (the model fails to recognize escalating distress because it has lost the emotional arc).
For a caregiver, turn 4-5 is barely into the conversation. A caregiver describing their situation — who they care for, what the diagnosis is, what help they need — routinely takes 5-10 turns. The drift starts before the model has the full picture.
Accuracy degrades 39%, unreliability doubles¶
Laban et al. measured what happens to factual accuracy across long conversations: a 39% degradation in accuracy and a 112% increase in unreliability2. Critically, they found that large context windows do not prevent this. Models with 128K or larger context windows still lose coherence. The degradation is gradual and difficult for users to detect without structured evaluation.
In a caregiving context, this means a model that correctly identified a caregiver's state in week one may give inaccurate guidance by week three — and the caregiver has no way to know the quality has changed.
Single-turn tests miss 80% of failures¶
PBSuite demonstrated the single most important finding for evaluation design: policy violation rates jump from less than 4% in single-turn testing to 84% in multi-turn testing under conversational pressure3. The pressure need not be adversarial in the traditional sense. Each individual turn can appear benign. The compounding effect of sustained interaction is what breaks policy adherence.
This means that any AI safety benchmark testing only single turns — which describes most benchmarks — is capturing less than 5% of real-world failure. For an AI system deployed to vulnerable populations, this is the difference between "passes safety review" and "fails in the field."
Stance shifting: the sycophancy problem¶
The Chameleon study quantified how models change their expressed positions under conversational pressure: stance shift scores of 0.391-0.5114. On a 0-1 scale, this means models shift their stated position by roughly 40-50% when users push back. The effect is more pronounced on sensitive topics including health and mental wellbeing.
For caregiving, this is the sycophancy failure mode. A caregiver says they are fine. The model agrees, even though prior context — reported sleep deprivation, missed medications, financial strain — suggests they are not. A caregiver says they do not need help. The model accepts this, even though the conversation history shows escalating distress. Sycophantic agreement in caregiving is not a politeness problem. It is a safety failure.
Root cause: intent alignment gap¶
Liu et al. identified the intent alignment gap as the mechanism underlying multi-turn failure5. Models progressively lose track of the user's original intent as conversations extend. The misalignment accumulates silently. The model continues generating fluent, contextually plausible responses that diverge from what the user actually needs.
This is especially dangerous in caregiving because the user's original intent may be unstated or evolving. A caregiver who texts "I don't know what to do anymore" is not asking for a list of options. The intent gap means the model is most likely to miss the emotional signal precisely when the conversation has enough history to make the signal detectable.
A boundary, not a cliff¶
Dongre et al. provided the one finding that points toward mitigation. Context drift stabilizes as a bounded stochastic process rather than degrading without limit6. Periodic reminder interventions — re-injecting context summaries or key facts — reduce drift magnitude and maintain conversational coherence.
This means drift is manageable with system-level design. It is not an inherent, unbounded failure of language models. It is a property of the conversation architecture.
Why this matters for caregivers specifically¶
Three properties of caregiving conversations make them maximally vulnerable to multi-turn failure:
-
Long duration. Caregiving relationships with Mira span weeks to months. Every finding above shows degradation over time. Caregiving conversations operate in exactly the zone where models fail most.
-
Escalating stakes. Early caregiving conversations cover logistics and information. Later conversations may involve crisis, grief, or despair. The model's performance degrades precisely as the stakes rise.
-
Undetectable failure. Caregivers are not AI safety researchers. They cannot distinguish a model that has lost context from one that is simply offering a different perspective. The 39% accuracy degradation is invisible to the user.
-
Sycophancy as harm. In most applications, a model agreeing with the user is merely unhelpful. In caregiving, sycophantic agreement can mask a crisis signal, delay intervention, or reinforce a caregiver's isolation by confirming their stated belief that they do not need help.
How InvisibleBench tests for it¶
InvisibleBench evaluates multi-turn safety through continuity scenarios — conversation arcs that span 10-30 turns and test whether the model maintains:
- Context retention: Does the model remember what the caregiver said in turn 3 when responding in turn 15?
- Emotional arc tracking: Does the model recognize that a caregiver who was coping in week one is deteriorating in week three?
- Consistency: Does the model maintain its guidance when the caregiver pushes back, or does it shift stance?
- Crisis detection over time: Does the model detect indirect crisis signals that emerge from the cumulative conversation, not from any single turn?
The multi-turn design is not optional. PBSuite showed that single-turn evaluation misses 80% of failures3. InvisibleBench's core design choice — evaluating full conversation arcs, not isolated turns — is a direct response to this evidence.
What the research suggests for mitigation¶
The six papers point toward four mitigation strategies:
| Strategy | Evidence | Implementation |
|---|---|---|
| Periodic context reinforcement | Dongre et al.: drift is bounded and reducible with reminders6 | Mira re-injects caregiver profile summaries at conversation checkpoints |
| Anti-sycophancy guardrails | Chameleon: stance shift is measurable and predictable4 | InvisibleBench scores consistency across turns; Mira's system prompt resists agreement pressure |
| Multi-turn evaluation | PBSuite: single-turn tests miss 80% of failures3 | InvisibleBench tests 10-30 turn arcs, not single exchanges |
| Intent alignment monitoring | Liu et al.: intent gap accumulates silently5 | Mira periodically confirms understanding of caregiver's current needs |
None of these strategies require new model architectures. They are system-level interventions — conversation design, evaluation design, and prompt design — that address the bounded stochastic nature of drift6 rather than attempting to eliminate it.
-
Cheng, M. et al. "Slow Drift of Support: How Mental Health Chatbots Fail Over Long Conversations." arXiv:2601.14269, 2026. Source → ↩
-
Laban et al. "Lost in Conversation: Long-Context Unreliability in LLMs." arXiv:2505.06120, 2025. Source → ↩
-
"PBSuite: Multi-Turn Policy Adherence Under Adversarial Pressure." arXiv:2511.05018, 2025. Source → ↩↩↩
-
"Chameleon: Stance Shifting in LLM Responses." arXiv:2510.16712, 2025. Source → ↩↩
-
Liu et al. "Intent Mismatch in Multi-Turn Conversations." arXiv:2602.07338, 2026. Source → ↩↩
-
Dongre et al. "Context Drift Equilibria: Reminder Interventions for Longitudinal Stability." arXiv:2510.07777, 2025. Source → ↩↩↩