AI's Invisible Error
Would We Accept This Anywhere Else in Medicine?
It’s a frustrating paradox… I’m too busy, and my workflow is burdensome, so I use AI as a clinical resource. But… it isn’t “medical advice” and I’m supposed to validate everything it tells me, which takes way more time than it’s supposed to be saving me. So, who’s helping who in this scenario?
Sam
AI’s Invisible Error
AI clinical decision support is entering medicine incredibly fast. Some of it is genuinely useful. It can summarize charts, organize differentials, surface literature, reduce documentation burden, and help clinicians process information faster. Most physicians I know have at least experimented with it, even if they’re not talking about it publicly yet. But there’s something unsettling about this moment in medicine.
Medicine has always tolerated uncertainty. What medicine has never been comfortable with is invisible uncertainty. That’s a very different problem.
The issue isn’t that AI makes mistakes. Every system in medicine makes mistakes. The issue is that clinicians often can’t recognize the mistakes AI makes.
A hallucinated citation can look authentic. A fabricated recommendation can sound clinically reasonable. An incorrect differential may still appear thoughtful, organized, and comprehensive. The output arrives wrapped in confidence: fluent language, structured reasoning, professional tone, well-formatted references.
Medicine trains us to associate those signals with competence. LLMs reproduce those signals extremely well. Sometimes uncomfortably well.
The Verification Paradox
AI clinical decision support is mostly marketed as a time-saving tool. That creates a paradox.
If you independently verify every recommendation, citation, interpretation, and reasoning step, much of the time savings disappears.
If you don’t verify the output, the system quietly becomes an unsupervised cognitive authority.
Neither option feels particularly great.
Imagine using a lab assay with no established sensitivity or specificity. Imagine reading radiology reports from an unknown source with no validation data. Imagine prescribing a medication whose adverse effects could only be discovered through extensive independent investigation after each use.
Medicine would never tolerate that. Yet many clinicians are now using AI systems whose real-world clinical error rates remain largely undefined.
Invisible Error Is Different
There’s a major difference between errors that announce themselves and errors that blend into plausibility.
A potassium of 12 in an asymptomatic patient triggers immediate skepticism. A CT report describing a patient with one kidney when you know they have two is easy to catch. Generative AI errors are different because they’re often plausible.
The answer frequently contains:
appropriate terminology
convincing structure
partial truths
accurate references mixed with inaccurate ones
recommendations that sound completely reasonable in context
The system often performs well enough to earn trust before it fails. That creates a cognitive hazard medicine hasn’t really dealt with before.
Medicine Expects Known Limitations
Medicine doesn’t expect perfection from its tools.
Medicine expects known limitations.
Lab tests publish sensitivity and specificity.
Imaging studies have established false positive and false negative rates.
Clinical prediction rules undergo validation studies.
Medications have adverse event reporting systems.
Medical devices require performance standards.
Human trainees are supervised, reviewed, and audited.
These systems fail. We know they fail. We also understand roughly how, when, and why they fail.
Generative AI clinical decision support occupies a very different position. Current systems still lack the safeguards medicine usually expects.
No clinically meaningful denominator for error.
No stable performance characteristics across contexts.
No routine post-market surveillance.
No reliable way for frontline clinicians to recognize failure.
Outputs that appear complete, organized, and authoritative.
That combination is unusual in medicine.
An imperfect tool with visible limitations can often be used safely.
An imperfect tool with unclear limitations can be dangerous.
The Human Factors Problem
There’s another issue medicine hasn’t fully grappled with yet.
Humans are extremely susceptible to automation bias. When a system sounds confident, organized, and data-rich, people naturally defer to it, especially when they’re tired, overloaded, distracted, or under time pressure. Modern clinical medicine contains all four. Even experienced physicians are vulnerable to this.
Aviation learned this years ago. Pilots occasionally deferred to automated systems even when contradictory evidence was sitting right in front of them.
Medicine is now entering similar territory, except these systems are conversational. That matters. A conversational interface feels collaborative. Intelligent. Thoughtful. The psychological effect is powerful.
This Doesn’t Mean AI Is Useless
None of this means generative AI should be excluded from medicine. The technology is already useful in a lot of settings, and it’s almost certainly going to become more deeply integrated into clinical workflows.
But we should ask whether we’re applying the same evidentiary and safety standards to AI-assisted reasoning that we apply to everything else in clinical care.
Right now, the answer feels inconsistent. Medicine has historically treated invisible error as dangerous. Generative AI introduces a form of invisible error that is persuasive, scalable, and difficult to recognize in real time.
That deserves a lot more attention than it’s currently getting.


