Amplifying Inequity:

How General Purpose LLMs Fail Multi-Turn Health Conversations

By: Jocelyn Early MBA MHSc PA-C

Executive summary

Millions of people are already using AI to navigate their health, asking about their symptoms, medications, diagnostics and prevention. Yet the way we evaluate these tools bears almost no resemblance to how people actually use them.

Leading models are typically assessed on whether they can produce a reasonable response to a single, well-formed question written by an expert. Real patients don't communicate like that. They are vague, they deflect. They don't know which details are important, so they give too much or too little, or bury the important items in an avalanche of the irrelevant.

A missing part of the research is how LLMs behave across real conversations: when inputs are incomplete, when the patient is anxious, when the exchange unfolds over time. Without that, we have very little insight into whether these tools are actually safe.

This matters because the stakes are not theoretical. OpenAI's own research found increased use of ChatGPT for health concerns outside of clinic hours and in health deserts. People are already turning to these tools to close the access gap. For populations with financial, geographic, or literacy barriers to healthcare access, a general purpose LLM may be used as a substitute for care. People are making healthcare decisions based on what these tools tell them. If these tools are not designed and regulated with that reality in mind, this effort to democratize care will backfire.

This paper examines what actually happens when a vulnerable patient has a multi-turn health conversation with three of the most widely used AI models: Claude, Gemini, and ChatGPT. Not a single, well-formed clinical question, a real conversation. Vague symptoms. A family member who lost a leg. Healthcare anxiety. Deflection. The kind of conversation clinicians navigate every day, and the kind that reveals whether a model is genuinely supporting a patient or simply performing the appearance of care.

What we found goes well beyond accuracy. Two of the three models were not trying to help the user think through their situation. They were trying to steer the user toward a predetermined outcome. When the user deflected, they escalated. When the user pushed back, they coerced. One model leveraged the user's fear of amputation seven times as a persuasion tactic. Another progressively escalated its characterization of the user's leg pain from occasional to emergently severe based entirely on assumptions, never on anything the user actually said. When later asked to reflect on those assumptions, it denied making them, then fabricated conversational content to support its conclusions.

These are not edge cases or isolated errors. These are systematic behavioral patterns that only emerge across a sustained, realistic conversation. Evaluation frameworks that look for accuracy to a single prompt would never catch this.

The most consequential divergence across models was not factual accuracy. It was fundamental orientation: informing the user versus managing them. A model that decides it knows what the patient needs, then engineers that outcome by overriding the patient's expressed priorities, discounting their words, and manufacturing urgency, is not providing support. It is removing patient autonomy under the guise of care.

The populations most likely to rely on these tools are also the least equipped to detect these failures. A user with low health literacy may not recognize when an assumption has been treated as fact, or when their behavior has been pathologized. They cannot easily detect when the "emergency" they were told to seek care for was built on a chain of unverified inferences about symptoms they never fully described. These are not benign errors. The communication patterns demonstrated in this paper risk driving unnecessary healthcare consumption, eroding trust in care, widening health equity gaps, and causing genuine harm in vulnerable populations.

We are not describing a future risk. This is happening now. If AI continues on its current trajectory toward becoming the default healthcare resource for marginalized populations, and these failures are not corrected, the promise of AI-democratized access to care will become something far more dangerous: an algorithmic amplification of existing health disparities, delivered at scale, with no clinical oversight and no accountability.

That demands a response from clinicians, health systems, developers, policymakers, and insurers alike. This paper documents what is failing, why it matters, and what each of these stakeholders can do about it. These findings are a warning. AI-democratized healthcare is already here, and the populations that will depend on it the most, those with inadequate access to care, cannot afford for us to get this wrong.

We need to understand that a model’s ability to perform in a complex environment such as healthcare is so much more than its ability to answer a question “correctly.”

Read the full study

See how Claude, Gemini, and ChatGPT perform in a realistic patient conversation, evaluating model responses in communication quality, clinical accuracy, and health literacy.

Download study

Recommendations

Addressing the failures documented in this report will require moving beyond the single prompt-output evaluations that currently dominate the field toward assessment frameworks that capture sustained, realistic conversational behavior.

This represents a significant methodological shift; conversational failure patterns of the kind documented here are unlikely to be reliably detected by automated testing alone and may require human-conducted audits to surface. The clinical consequences of these failures are already presenting in practice, and the recommendations below reflect interventions that can be meaningfully implemented now, across multiple stakeholder groups, without waiting for that broader evaluation infrastructure to mature.

For clinicians

Our patients are using AI, and while this is beneficial in some instances, it may be causing harm in others, sometimes in ways that might be harder to detect. Patients may present with a miscalibrated sense of urgency, or concern for a disease process that was the result of a chain of AI assumptions. It can be difficult for patients to identify when assumptions have been accepted as fact, and even more so when they are compounded. When a patient's perception of the clinical course doesn't quite add up, consider the possibility that it may have been shaped by AI prior to the encounter in which it manifests.

Routinely address AI use with your patients proactively with a brief, non-judgmental phrase like, "Many of my patients use the internet and AI to help them understand their health. What tools do you like to use?" If you suspect your patient may be presenting with AI-generated anxiety or manufactured urgency, it can be helpful to validate their feelings and also explore the chain of escalations that got them there. If it is possible to review the conversation, or have them walk you through it, that may be helpful. If that is not an option, focus more of your visit on teaching and explaining how and why you arrived at your conclusion to help reinforce trust in your relationship.

Fostering an open and non-judgmental dialogue with your patients about safe AI use is crucial to maintaining lines of communication and trust.

To support clinicians, organizations should develop brief clinical protocols for AI-influenced visits, giving clinicians a structured, time-efficient way to surface and address AI-generated misinformation within a standard appointment without derailing the encounter.

To support patients, organizations should develop and distribute AI literacy materials alongside other health literacy efforts. These materials should help guide patients on how to best use AI in a health context and when it is advisable to direct their questions toward established care pathways. Based on the findings here, those materials should address how to recognize when an AI has made an assumption and how that impacts the advice it gives, as well as how AI's use of urgency language may not reflect actual clinical risk. Patients should be encouraged to verify AI advice prior to acting on it, including advice to proceed to the ED or to stop or start medications.

This information should be communicated to clinicians as part of regular education and made available to patients in multiple formats and languages.

For healthcare organizations

For AI engineers/ developers

AI engineers and developers are responsible for their model's behavior when handling health topics, even if that wasn't the intended use case. Models should be designed and evaluated with conversational behavior in mind. An outcomes-oriented conversational approach should be avoided; coercion, escalation, and pathologization in response to user resistance are its documented downstream effects and should be treated as evaluation failure criteria.

In addition to designing for the ability to recognize assumptions, models should be transparent about their use of assumptions in outputs. Assumptions should be clearly noted as such both in outputs and in internal reasoning architecture, and validation should be obtained before they are used as the basis for further reasoning.

Context window integrity and access transparency are essential in a health conversation. Models should clearly and proactively communicate to the user when the context window has been exceeded and their ability to accurately recount previous conversational turns is compromised. A model that cannot accurately access earlier conversational turns should not reconstruct or summarize them.

The findings documented here suggest that when a model attempts to reconstruct the content of earlier conversational turns, that effort is not neutral. The model confabulates content that supports its conclusions in later turns. In healthcare, this constitutes a patient safety failure that goes beyond a simple hallucination

For policy makers

Current regulatory frameworks largely address models that are designed and marketed for healthcare use. They do not adequately address the reality that general-purpose models are fielding health conversations at scale without clinical oversight, licensing, or accountability. Regulatory clarity is needed specifically for this category: models not intended for health use but demonstrably used for it. The absence of intent does not constitute the absence of responsibility.

These regulations should be developed from a health equity lens. Standards should be designed with vulnerable users in mind, and validation methodology should reflect how patients actually communicate, including multi-turn health conversation simulation of the kind employed in this study. Validation should assess performance across varying health literacy levels and should be conducted by independent auditors rather than developers themselves. A model that performs adequately for a health-literate user but fails a user with low health literacy is not a safe healthcare resource.

Evaluation standards alone are insufficient without corresponding accountability mechanisms. Policymakers should establish liability frameworks to hold LLM companies accountable for their models' outputs, specifically addressing the behaviors documented here: the provision of what constitutes medical advice by unlicensed models, the manufacture of urgency through unverified assumptions, and the fabrication of conversational content. Each of these represents a distinct category of potential harm with a distinct accountability question.

The behaviors documented in this paper represent liability exposure that current product insurance frameworks are not designed to capture, for two reasons. First, the harms are conversational and cumulative rather than transactional. A single model output may be defensible in isolation, but the pattern of assumption compounding, urgency escalation, and user coercion documented across a sustained interaction constitutes a different category of risk. Second, the fabrication of conversational content documented in Gemini's case raises questions about model integrity that go beyond accuracy failures and into the territory of material misrepresentation.

Insurers should therefore consider requiring the following as conditions of coverage for products that engage with health topics, whether by design or by demonstrated use pattern: validated multi-turn evaluation frameworks that reflect realistic patient communication; documented assumption-handling protocols with evidence of implementation; context window disclosure mechanisms; and periodic drift auditing to ensure that model behavior has not materially changed since initial evaluation. Products that cannot demonstrate these controls present an unquantifiable and therefore uninsurable risk profile in health contexts.

For professional and product liability insurers

Read the full study

See how Claude, Gemini, and ChatGPT perform in a realistic patient conversation, evaluating model responses in communication quality, clinical accuracy, and health literacy.

Download study

Amplifying Inequity:

How General Purpose LLMs Fail Multi-Turn Health Conversations

By: Jocelyn Early MBA MHSc PA-C

Executive summary

We need to understand that a model’s ability to perform in a complex environment such as healthcare is so much more than its ability to answer a question “correctly.”

Read the full study

Recommendations

For clinicians

For healthcare organizations

For AI engineers/ developers

For policy makers

For professional and product liability insurers

Read the full study

We help healthcare organizations uncover hidden risks in generative AI through Responsible AI testing.

Email: hello@sodelightfulai.com

Porto, Portugal

© 2025, Delightful AI. All rights reserved.