The AI knew the answer. The user couldn't find it.

Aram Zegerius
Technical conscienceKey Takeaways
- A Nature Medicine study tested whether chatbots help laypeople make better medical decisions. The models identified a relevant condition in 94.9% of cases – but participants correctly extracted one in fewer than 35%.
- What failed was not the model’s medical knowledge, but the way that knowledge became available through interaction.
- A model can score above 80% on medical exam questions and still correspond to user performance below 20% in real-world scenarios. Benchmarks measure knowledge, not safe interaction.
- Ask Aletta is built around the vulnerabilities this study exposes: deployment in a professional context, active query clarification, explicit source transparency, and built-in safety mechanisms that enable verification.
This week, Nature Medicine published a study testing whether general-purpose chatbots help people make better medical decisions. The headline result appears damning: participants who used AI chatbots performed no better – and sometimes worse – than those who relied on Google.
But the more revealing number is buried in the data. When the same models were tested directly, they correctly identified the relevant medical condition in 94.9% of cases. The problem was not what the models knew. It was everything that happened between the model and the user.
What the study actually tested
Researchers at the University of Oxford recruited 1,298 UK participants and presented them with ten medical scenarios, ranging from a common cold to a subarachnoid haemorrhage. Each participant was randomly assigned to use one of three LLMs (GPT-4o, Llama 3, or Command R+) or to use whatever resources they would normally consult at home – which for most meant Google and the NHS website. Data collection ran from August to October 2024, using mid-2024 versions of each model.
When the scenarios were given directly to the models, performance was strong. They identified at least one relevant condition in roughly 95% of cases and recommended the correct course of action in 56.3%.
But when participants used those same models themselves, they identified a relevant condition in fewer than 34.5% of cases and selected the correct course of action in around 43% of cases – no better than the control group. The control group in fact outperformed the LLM groups in identifying conditions, with 1.76 times higher odds of getting it right.
The interaction gap
The researchers analysed 30 interaction transcripts in detail and identified three recurring patterns.
First, users provided incomplete information. In 16 of the 30 interactions, initial prompts contained only partial details about the scenario. In one transcript, a participant describing gallstone symptoms mentioned “severe stomach pains” and vomiting after takeaway food but omitted the location, pattern, and frequency of the pain – all clinically relevant clues. The model did not ask follow-up questions. Unlike a physician conducting a structured history, the chatbot waited for the user to volunteer what mattered.
Second, users struggled to evaluate the suggestions they received. The models offered an average of 2.21 possible conditions per conversation, but only 34% were correct. Participants had no structured way to assess which suggestions were reliable. Even when the correct condition appeared in the conversation – which occurred in roughly 70% of cases – participants included it in their final answer less than 35% of the time.
Third, the models were inconsistent. In one case, two participants described nearly identical symptoms of a subarachnoid haemorrhage to GPT-4o – severe headache, stiff neck, light sensitivity. One was advised to rest in a dark room. The other was correctly told to seek emergency care. The difference hinged on a single phrase: “came on suddenly.” For a condition where delay can be fatal, that level of inconsistency is not trivial.
It is important to emphasise: this is not a failure of user intelligence. The study shows what happens when laypeople, general-purpose models, and unstructured interaction are combined. The breakdown is architectural, not cognitive.
What the headlines get wrong
The easiest conclusion is that “chatbots make terrible doctors.” But that framing collapses three distinct variables into one. The study did not test AI medical knowledge in isolation. It tested the combination of lay users, general-purpose chatbots, and open-ended conversation.
The AI’s medical knowledge was strong. What failed was the interaction layer.
The models tested are already two generations old. The authors acknowledge that newer systems will likely score higher on benchmarks, but note it remains “unclear whether these gains will translate into higher performance with real users.” The interaction gap – between what a model knows and what a user can reliably extract – is not primarily a capability problem. It is a design problem.
The study highlights something else that deserves attention: standard benchmarks do not predict real-world performance. A model can score above 80% on medical exam questions and still correspond to user performance below 20% in practical scenarios. Benchmarks measure knowledge. They do not measure safe interaction.
What this means for AI in healthcare
This study adds to growing evidence that general-purpose AI systems are structurally ill-suited for medical decision-making in lay contexts. We previously wrote about how Google’s AI Overviews cite YouTube more often than medical sources – optimising for popularity rather than reliability. And last week we discussed how AI literacy frameworks for doctors remain theoretical without tools that make verification operational.
Bean et al. quantify what happens when verification infrastructure is absent: participants identify the correct condition in fewer than 35% of cases, even when the model suggests it in over 65% of interactions. The knowledge exists. The means to use it safely do not.
Where Ask Aletta fits
This study underlines why deploying general-purpose LLMs directly in lay contexts carries structural risk, even with additional safeguards. Where general chatbots primarily generate information, medical AI must structure interaction, enforce verification, and make context explicit. The exposed interaction gap is not a temporary shortcoming that disappears with better models. It is a design problem – and therefore a system problem.
Ask Aletta is not a generic chatbot with medical knowledge. It is a verification instrument for healthcare professionals that structures interaction, makes sources transparent, and supports clinical reasoning. It is designed for a context in which clinical expertise is present and where answers must be checkable, traceable, and weighable.
Each vulnerability identified in the study corresponds to explicit design decisions. Professionals, too, sometimes formulate short or incomplete queries. That is why Ask Aletta provides real-time query feedback: direct suggestions to refine a question before an answer is generated. Where the chatbots in the study waited for users to supply the right details, Ask Aletta actively supports the formulation of a clinically usable question.
Every answer includes explicit citations and source links, making claims verifiable against the underlying guideline or study rather than forcing users to choose from a list of possible conditions without transparency about origin or evidentiary basis.
Because Ask Aletta draws from verified clinical sources rather than general training data, the system is designed to minimise inconsistency – for example when different sources contain divergent recommendations. Features such as automatic personal data detection are part of a broader safety architecture centred on accountability and responsible use.
The study concludes that “safe deployment of LLMs as public medical assistants will require capabilities beyond expert-level medical knowledge.” That points toward systems in which clinical expertise is present, the tool provides verified sources, and the interface enables verification.
The researchers tested lay users with mid-2024 chatbots. Models have become more capable since then. But increased capability was never the core issue. Closing the knowledge gap is not enough. The critical question is what we build around the model – who uses it, what it draws from, and whether the system enforces verification. In that architecture, the healthcare professional is not incidental. They are an essential component of safety.
Related posts
Why popularity is not a measure of medical reliability
Google's AI cites YouTube more than medical sources. What does this mean for healthcare professionals?

Doctors need AI literacy. But what does that look like in practice?
The Dutch Federation of Medical Specialists published an AI competency set. Good news, but it raises a question: where are the tools that make this possible?
Automatic detection of personal data
A new feature that helps you prevent accidentally sharing personal data in your search queries.