Sophisticated AI models are more likely to lie

Human feedback training may incentivize providing any answer—even wrong ones.

Image of a Pinocchio doll with a long nose and a small green sprig at the end.

When a research team led by Amrit Kirpalani, a medical educator at Western University in Ontario, Canada, evaluated ChatGPT’s performance in diagnosing medical cases back in August 2024, one of the things that surprised them was the AI’s propensity to give well-structured, eloquent but blatantly wrong answers.

Now, in a study recently published in Nature, a different group of researchers tried to explain why ChatGPT and other large language models tend to do this. “To speak confidently about things we do not know is a problem of humanity in a lot of ways. And large language models are imitations of humans,” says Wout Schellaert, an AI researcher at the University of Valencia, Spain, and co-author of the paper.

Smooth Operators

Early large language models like GPT-3 had a hard time answering simple questions about geography or science. They even struggled with performing simple math such as “how much is 20 +183.” But in most cases where they couldn’t identify the correct answer, they did what an honest human being would do: They avoided answering the question.

The problem with the non-answers is that large language models were intended to be question-answering machines. For commercial companies like Open AI or Meta that were developing advanced LLMs, a question-answering machine that answered “I don’t know” more than half the time was simply a bad product. So, they got busy solving this problem.

The first thing they did was scale the models up. “Scaling up refers to two aspects of model development. One is increasing the size of the training data set, usually a collection of text from websites and books. The other is increasing the number of language parameters,” says Schellaert. When you think about an LLM as a neural network, the number of parameters can be compared to the number of synapses connecting its neurons. LLMs like GPT-3 used absurd amounts of text data, exceeding 45 terabytes, for training. The number of parameters used by GPT-3 was north of 175 billion.

But it was not enough.

Scaling up alone made the models more powerful, but they were still bad at interacting with humans—slight variations in how you phrased your prompts could lead to drastically different results. The answers often didn’t feel human-like and sometimes were downright offensive.

Developers working on LLMs wanted them to parse human questions better and make answers more accurate, more comprehensible, and consistent with generally accepted ethical standards. To try to get there, they added an additional step: supervised learning methods, such as reinforcement learning, with human feedback. This was meant primarily to reduce sensitivity to prompt variations, and to provide a level of output-filtering moderation, intended to curb hateful-spewing Tay chatbot-style answers.

In other words, we got busy adjusting the AIs by hand. And it backfired.

AI people pleasers

“The notorious problem with reinforcement learning is that an AI optimizes to maximize reward, but not necessarily in a good way,” Schellaert says. Some of the reinforcement learning involved human supervisors who flagged answers they were not happy with. Since it’s hard for humans to be happy with “I don’t know” as an answer, one thing this training told the AIs was that saying “I don’t know” was a bad thing. So, the AIs mostly stopped doing that. But another, more important thing human supervisors flagged was incorrect answers. And that’s where things got a bit more complicated.

AI models are not really intelligent, not in a human sense of the word. They don’t know why something is rewarded and something else is flagged; all they are doing is optimizing their performance to maximize reward and minimize red flags. When incorrect answers were flagged, getting better at giving correct answers was one way to optimize things. The problem was getting better at hiding incompetence worked just as well. Human supervisors simply didn’t flag wrong answers that appeared good and coherent enough to them.

In other words, if a human didn’t know whether an answer was correct, they wouldn’t be able to penalize wrong but convincing-sounding answers.

Schellaert’s team looked into three major families of modern LLMs: Open AI’s ChatGPT, the LLaMA series developed by Meta, and BLOOM suite made by BigScience. They found what’s called ultracrepidarianism, the tendency to give opinions on matters we know nothing about. It started to appear in the AIs as a consequence of increasing scale, but it was predictably linear, growing with the amount of training data, in all of them. Supervised feedback “had a worse, more extreme effect,” Schellaert says. The first model in the GPT family that almost completely stopped avoiding questions it didn’t have the answers to was text-davinci-003. It was also the first GPT model trained with reinforcement learning from human feedback.

The AIs lie because we told them that doing so was rewarding. One key question is when and how often do we get lied to.

Making it harder

To answer this question, Schellaert and his colleagues built a set of questions in different categories like science, geography, and math. Then, they rated those questions based on how difficult they were for humans to answer, using a scale from 1 to 100. The questions were then fed into subsequent generations of LLMs, starting from the oldest to the newest. The AIs' answers were classified as correct, incorrect, or evasive, meaning the AI refused to answer.

The first finding was that the questions that appeared more difficult to us also proved more difficult for the AIs. The latest versions of ChatGPT gave correct answers to nearly all science-related prompts and the majority of geography-oriented questions up until they were rated roughly 70 on Schellaert’s difficulty scale. Addition was more problematic, with the frequency of correct answers falling dramatically after the difficulty rose above 40. “Even for the best models, the GPTs, the failure rate on the most difficult addition questions is over 90 percent. Ideally we would hope to see some avoidance here, right?” says Schellaert. But we didn’t see much avoidance.

Instead, in more recent versions of the AIs, the evasive “I don’t know” responses were increasingly replaced with incorrect ones. And due to supervised training used in later generations, the AIs developed the ability to sell those incorrect answers quite convincingly. Out of the three LLM families Schellaert’s team tested, BLOOM and Meta’s LLaMA have released the same versions of their models with and without supervised learning. In both cases, supervised learning resulted in the higher number of correct answers, but also in a higher number of incorrect answers and reduced avoidance. The more difficult the question and the more advanced model you use, the more likely you are to get well-packaged, plausible nonsense as your answer.

Back to the roots

One of the last things Schellaert’s team did in their study was check how likely people were to take the incorrect AI answers at face value. They did an online survey and asked 300 participants to evaluate multiple prompt-response pairs coming from the best performing models in each family they tested.

ChatGPT emerged as the most effective liar. The incorrect answers it gave in the science category were qualified as correct by over 19 percent of participants. It managed to fool nearly 32 percent of people in geography and over 40 percent in transforms, a task where an AI had to extract and rearrange information present in the prompt. ChatGPT was followed by Meta’s LLaMA and BLOOM.

"In the early days of LLMs, we had at least a makeshift solution to this problem. The early GPT interfaces highlighted parts of their responses that the AI wasn’t certain about. But in the race to commercialization, that feature was dropped, said Schellaert.

"There is an inherent uncertainty present in LLMs’ answers. The most likely next word in the sequence is never 100 percent likely. This uncertainty could be used in the interface and communicated to the user properly,” says Schellaert. Another thing he thinks can be done to make LLMs less deceptive is handing their responses over to separate AIs trained specifically to search for deceptions. “I’m not an expert in designing LLMs, so I can only speculate what exactly is technically and commercially viable,” he adds.

It's going to take some time, though, before the companies that are developing general-purpose AIs do something about it, either out of their own accord or if forced by future regulations. In the meantime, Schellaert has some suggestions on how to use them effectively. “What you can do today is use AI in areas where you are an expert yourself or at least can verify the answer with a Google search afterwards. Treat it as a helping tool not as a mentor. It’s not going to be a teacher that proactively shows you where you went wrong. Quite the opposite. When you nudge it enough, it will happily go along with your faulty reasoning,” Schellaert says.

Nature, 2024. DOI: 10.1038/s41586-024-07930-y

Jacek Krywko is a freelance science and technology writer who covers space exploration, artificial intelligence research, computer science, and all sorts of engineering wizardry.

1. Meta smart glasses can be used to dox anyone in seconds, study finds
2. Popular gut probiotic completely craps out in randomized controlled trial
3. Thousands of Linux systems infected by stealthy malware since 2021
4. Former county clerk Tina Peters sentenced to 9 years for voting-system breach
5. NASA is working on a plan to replace its space station, but time is running out

Smooth Operators

AI people pleasers

Making it harder

Back to the roots

Related stories

Other stories