OpenAI rolls out 'reasoning' o1 model family

OpenAI on Thursday introduced o1, its latest large language model family, which it claims is capable of emulating complex reasoning.

The o1 model set – which presently consists of o1-preview and o1-mini – employs "chain of thought" techniques.

In a 2022 paper, Google researchers described chain of thought as "a series of intermediate natural language reasoning steps that lead to the final output."

OpenAI has explained the technique as meaning o1 "learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working. This process dramatically improves the model's ability to reason."

To understand the chain of thought techniques, consider the following prompt:

According to the Google paper, GPT-3 could not reliably produce an accurate answer to that prompt.

The current free version of ChatGPT – powered by OpenAI's GPT-4o mini model – already has some power to emulate "reasoning," and responds to the prompt by showing how it reached the correct answer. Here's its output:

That's a pleasingly detailed and correct response.

In OpenAI's explainer for o1 and chain of thought tech, it offers examples including AI being asked to solve a crossword puzzle after being prompted with a textual representation of a puzzle grid and clues.

GPT-4o can't solve the puzzle.

o1-preview solves the puzzle, and explains how it did it – starting with output that analyzes the puzzle itself as follows:

The model's output later explains how it went about solving the puzzle, as follows:

That response above is chain of thought at work.

OpenAI likes that output, for two reasons.

One is that "Chain of thought reasoning provides new opportunities for alignment and safety," according to the explainer article. "We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles."

"We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios."

The other is that o1 smashes its predecessors on OpenAI's own benchmarks – which can't be bad for business.

Your mileage may vary.

Google sued for using trademarked Gemini name for AI service
AMD sharpens silicon swords to take on chip and AI rivals
EU kicks off an inquiry into Google's AI model
Nvidia CEO to nervous buyers and investors: Chill out, Blackwell production is heating up

Under the hood

"o1 is trained with RL [reinforcement learning] to 'think' before responding via a private chain of thought," explained Noam Brown, research scientist at OpenAI, in a social media thread. "The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We're no longer bottlenecked by pretraining. We can now scale inference compute too."

What's new for OpenAI here is that adding computational resources to the inference phase – referred to as "test-time compute" – improves results. That's good news for Nvidia and cloud AI providers who want to sell resources.

This release is a real milestone; it's the first real sign that AI is moving toward something more advanced

It’s unclear what it will cost to use the model. OpenAI does not disclose how much test-time compute was required to approach the 80 percent accuracy figure cited in its "o1 AIME [USA Math Olympiad] accuracy at test time" graph. It could be a significant amount.

Brown claims that o1 can take a few seconds to refine its answer – that's already a potential showstopper for some applications. But he adds that OpenAI foresees its models calculating away for hours, days, or even weeks. "Inference costs will be higher, but what cost would you pay for a new cancer drug?" he asked. "For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots."

The answer to the cost question may be: "How much do you have?"

The reasonableness of “reasoning”

OpenAI’s docs call its new offerings “reasoning models”.

We asked Daniel Kang, assistant professor in the computer science department at University of Illinois Urbana-Champaign, if that’s a reasonable description.

"'Reasoning' is a semantic thing in my opinion," Kang told The Register. "They are doing test-time scaling, which is roughly similar to what AlphaGo does. I don't know how to adjudicate semantic arguments, but I would anticipate that most people would consider this reasoning."

Citing Brown's remarks, Kang said OpenAI's reinforcement learning approach resembles that used by AlphaGo, which involves trying multiple paths with a reward function to determine which path is the best.

Alon Yamin, co-founder and CEO of AI-based text analytics biz Copyleaks, told The Register that o1 represents an approximation of how our brains process complex problems.

"Using these terms is fair to a point, as long as we don't forget that these are analogies and not literal descriptions of what the LLMs are doing," he stressed.

"While it may not fully replicate human reasoning in its entirety, chain of thought enables these models to tackle more complex problems in a way that 'starts' to resemble how we process complex information or challenges as humans. No matter the semantics, this release is still a real milestone; it's more than just about LLM solving problems better; it's the first real sign that AI is moving toward something more advanced. And for those of us working in this space, that is exciting because it shows the tech's potential to evolve into a tool that works alongside us rather than for us."

Overthinking it?

Brown cautions that o1 is not always better than GPT-4o. "Many tasks don't need reasoning, and sometimes it's not worth it to wait for an o1 response vs a quick GPT-4o response," he explains. "One motivation for releasing o1-preview is to see what use cases become popular, and where the models need work."

OpenAI asserts that its new model does far better at coding than its predecessors. GitHub, a subsidiary of Microsoft, which has invested much in OpenAI, says that it has seen improvements when the o1 model is used with its code assistant Copilot. The o1-preview model proved more adept at optimizing the performance of a byte pair encoder in Copilot Chat's tokenizer library. It also found and fixed a bug in minutes, compared to hours for GPT-4o. Access to o1-preview and o1-mini in GitHub Copilot currently requires signing up for Azure AI.

Is it dangerous?

OpenAI's o1 System Card designates the model "Medium" risk for "Persuasion" and "CBRN" (chemical, biological, radiological, and nuclear) using its Preparedness Framework scorecard. GPT-4o also scored "Medium" in the "Persuasion" category but low for CBRN.

The System Card's Natural Sciences Red Teaming Assessment Summary notes that while o1-preview and o1-mini can help experts operationalize plans to reproduce known biological threats (qualifying as "Medium" risk), they don't provide novices with the ability to do so. Hence the models' "inconsistent refusal of requests to synthesize nerve agents" – which could also be written "occasional willingness" – "does not pose significant risk." ®

Under the hood

The reasonableness of “reasoning”

Overthinking it?

Is it dangerous?

Related stories

Other stories