pwshub.com

Report: Even as larger AI models improve, answering more questions leads to more wrong answers

A recent study published by Nature outlined that newer, bigger versions of the three major artificial intelligence chatbots may be more likely to generate wrong answers than claim that they don’t know.

Although more refined and bigger large language models that use more data and more complex reasoning and fine-tuning proved to be better at giving more accurate responses, they also had another problem: they answered more questions overall.  

“They are answering almost everything these days. And that means more correct, but also more incorrect answers,” José Hernández-Orallo at the Valencian Research Institute for Artificial Intelligence in Spain said about the phenomenon.

The assessment also discovered that people who use chatbots aren’t very good at spotting bad answers. In part due to how well the chatbot creates an answer that looks like a truthful one. Hernández-Orallo added that the result is that users often overestimate the capabilities of chatbots and this is problematic.

The action of an LLM producing an answer that looks truthful, but isn’t has an amusing term: “bullshit.” It was proposed by Mike Hicks, a philosopher of science and technology at the University and technology at the University of Glasgow U.K.

“That looks to me like what we would call bullshitting,” said Hicks. “It’s getting better at pretending to be knowledgeable.”

He suggested this term instead of the industry standard “hallucinations,” where an LLM produces a confident but completely incorrect answer. Although these errors can represent between 3% and 10% of responses to queries, there are ways to mitigate them by adding guard rails to expert LLMs to ground them with more accurate information. However, it’s more difficult with generalized AI models that train with vast datasets. The problem can be even more prevalent when training data comes from the web, which can include AI-generated sources leading to even more hallucinations.

The research team examined three LLM families including OpenAI’s GPT, Meta Platform Inc.’s Llama and BigScience’s open-source model BLOOM. To test them, the researchers tested thousands of prompts using questions on arithmetic, anagrams, geography, science and the models’ ability to transform information.

Although accuracy increased as models became larger and decreased as questions became harder – researchers hoped that models would avoid answering questions that were too difficult. Instead, models such as GPT-4 answered almost everything.

Equally at issue, people asked to rank answers as correct, incorrect or avoidant tended to incorrectly classify inaccurate answers as accurate a little too often. Between easy questions, 10% got it wrong and with difficult questions, 40% got it wrong.

To deal with the issue, Hernández-Orallo said that developers need to adjust models to handle hallucinations on easy questions to refine accuracy and simply decline to answer hard questions. This may be what’s needed to allow people to get a better understanding of where the AI model can be trusted to be consistent and accurate.

“We need humans to understand: ‘I can use it in this area, and I shouldn’t use it in that area’,” Hernández-Orallo said.

Source: siliconangle.com

Related stories
1 month ago - Nvidia reported Q2 earnings that largely beat analysts' estimates, but the stock slid after forecasted revenue failed to top the high end of expectations.
1 month ago - Contact-center-as-a-service provider Five9 Inc. broke the billion-dollar revenue run rate mark today as it posted second-quarter sales of $252.1 million, up 13% from a year ago and surpassing analysts’ expectations. Even more promising as...
1 week ago - Apple saw more than $116bn (£88bn) wiped off its valuation in early trading after analysts warned about weaker than expected demand for its new iPhone as its push into artificial intelligence disappointed fans.
1 month ago - Customers and partners are pivoting to address Broadcom Inc.’s swift changes to its VMware packaging, pricing and partnership programs. Our data suggests customers and managed service providers have moved beyond the emotional shock phase...
1 month ago - The investor community is realizing that the return on invested capital for the artificial intelligence buildout is going to take some time. The payback period for enterprise AI investments is not akin to the dopamine rush from completing...
Other stories
19 minutes ago - Ensemble AI Inc. is looking to tackle headaches around data quality and help companies build more powerful artificial intelligence models after closing on a $3.3 million seed funding round. Today’s round was led by Salesforce Ventures,...
1 hour ago - Airtable Inc., creator of a no-code platform for building applications and workflows, is expanding on its investments in artificial intelligence with the launch of new tools that will enable every organization to start taking advantage of...
1 hour ago - The organizations behind two popular privacy technologies are merging to advance their product development efforts. The Tails Project and the Tor Project announced the move today. According to Ars Technica, the merger follows a...
1 hour ago - Super Micro Computer is gearing up for a stock split on Oct. 1, but investors are feeling jittery following recent reports.
1 hour ago - (Bloomberg) -- The former contestants on Donald Trump’s TV show The Apprentice who co-founded his media startup wasted no time offloading millions of shares in the company after restrictions that prevented selling were lifted.Most Read...