It’s no surprise to those of us following generative artificial intelligence news that AI is imperfect. In fact, generative AI so often spits out untrue, false, and otherwise incorrect outputs that we have a name for it: hallucinations.

That’s part of the problem with outsourcing so much of our work and tasks to AI in this moment. AI can be used for good, but blindly trusting it to handle important tasks without oversight or fact-checking runs a real risk. We’re now seeing the consequences of that play out in concerning ways.

OpenAI’s Whisper has a hallucination problem

The latest high-profile hallucination case concerns Whisper, an AI-powered transcription tool from ChatGPT-maker OpenAI. Whisper is popular: Transcription services frequently tap into the platform to power their tools, which, in turn, are used by many users and customers to make transcribing conversations quicker and easier. On the surface, that’s a good thing: Whisper, and the services it enables, has had a positive reputation among users, and the platform is growing in use across industries.

However, hallucination is getting in the way. As reported by AP News, researchers and experts are sounding the alarms about Whisper, claiming that not only is it inaccurate, it often makes things up entirely. While all AI is prone to hallucinating, researchers warn that Whisper will report things were said that absolutely were not, including “racial commentary, violent rhetoric and even imagined medical treatments.”

That’s bad enough for those of us who use Whisper for personal use. But the larger concern here is that Whisper has a large base of users in professional industries: Subtitles you see when watching a video online may be generated by Whisper, which could impact the impression that video gives off to users who are deaf or hard of hearing. Important interviews may be transcribed using Whisper-powered tools, which may leave incorrect records of what was actually said.

Your conversations with your doctors may be transcribed inaccurately

However, the situation garnering the most attention right now is Whisper’s use within hospitals and medical centers. Researchers are concerned by the number of doctors and medical professionals that have turned to Whisper tools to transcribe their conversations with patients. Your discussion about your health with your doctor may be recorded, then analyzed by Whisper, only to be transcribed with totally false statements that were never a part of the conversation.

This isn’t hypothetical, either: Different researchers have each reached similar conclusions by studying the transcriptions of Whisper-powered tools. AP News rounded up some of these results: A University of Michigan researcher discovered hallucinations in eight out of 10 transcriptions made by Whisper; a machine learning engineer found issues with 50% of the transcriptions he investigated; and one researcher found hallucinations in almost all of the 26,000 Whisper transcriptions they produced. A study even found consistent hallucinations when the audio recordings were short and clear.

But it’s the reporting from Cornell University professors Allison Koenecke and Mona Sloane that offer the most visceral look at the situation: These professors found nearly 40% of the hallucinations they found in transcripts taken from Carnegie Mellon research repository TalkBank were “harmful or concerning,” as the speaker could be “misinterpreted or misrepresented.”

In one example, the speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.” The AI added the following to the transcription: “He took a big piece of a cross, a teeny, small piece…I’m sure he didn’t have a terror knife so he killed a number of people.” In another example, the speaker said, “two other girls and one lady,” while the AI turned it into, “two other girls and one lady, um, which were Black.”

When you take all this into consideration, it seems concerning that over 30,000 clinicians and 40 health systems are currently using Whisper via a tool developed by Nabla. What’s worse, you cannot check the transcriptions against the original recordings to identify whether Nabla’s tool hallucinated part of the report, as Nabla designed the tool to delete the audio for “data safety reasons.” According to the company, around seven million medical visits have used the tool to transcribe conversations.

Is AI really ready for prime time?

Generative AI as a technology isn’t new, but ChatGPT really kicked off its general adoption in late 2022. Since then, companies have raced to build and add AI into their platforms and services. Why wouldn’t they? It seemed like the public really liked AI, and, well, generative AI seemed like it could do just about anything. Why not embrace it, and use the “magic” of AI to superpower tasks like transcriptions?

We’re seeing why at this moment. AI has a lot of potential, but also plenty of downsides. Hallucinations aren’t just an occasional annoyance: They’re a byproduct of the technology, a flaw built into the fabric of neural networks. We don’t totally understand why AI models hallucinate, and that’s part of the problem. We’re trusting technology with flaws we don’t fully understand to handle important work for us, so much so we’re deleting the data that could be used to double-check AI’s outputs in the name of safety.

Personally, I don’t feel safe knowing my medical records could contain outright falsehoods, just because my doctor’s office decided to employ Nabla’s tools in their system.