Google latest models prove AI doesn't have to suck at math

Researchers at Google DeepMind claim they've developed a pair of AI models capable of taking home a silver medal in the International Mathematical Olympiad (IMO) – although not within the allotted time limit.

Dubbed AlphaProof and AlphaGeometry 2, these models are designed to help solve one of the bigger hurdles facing popular AI systems today: thanks to limitations in artificial reasoning and training data, they kind of suck at math.

To overcome this, DeepMind developed AlphaProof – which combines a language model with its AlphaZero reinforcement learning algorithm. You may recall this is the same reinforcement model that DeepMind used to master chess, shogi and Go a few years back.

AlphaProof trains itself to prove mathematical statements using a functional programming language called Lean.

"Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified," the DeepMind team wrote in a recent blog.

This means that AlphaProof can not only provide an answer, but prove that it's correct. This differs from existing natural language processing, which will confidently hallucinate a plausible sounding answer – but doesn't actually know whether it's correct.

That's not to say NLP or LLMs can't be useful. DeepMind fine-tuned a Gemini LLM to translate natural language problem statements into ones the AlphaZero algorithm can interpret and use.

"When presented with a problem, AlphaProof generates candidates and then proves or disproves them by searching other possible proof steps in Lean. Every time a solution is verified it's used to reinforce the model.

The DeepMind team did this for millions of problems leading up to the IMO competition – in fact, because it's a reinforcement model, training continued throughout the competition.

AlphaGeometry 2, on the other hand, functions in much the same way as its predecessor, detailed back in January. It combines a neural language model with a "rule-bound deduction engine" and the two work together to find proofs for problems.

As DeepMind explains it, the language model is used to identify patterns and suggest useful constructs. The symbolic engine, meanwhile, uses formal logic to arrive at solutions. The downside to this approach is the second step is comparatively slow.

For its second-gen AlphaGeometry model, DeepMind explained the model's language processing is based on Gemini and trained using an "order of magnitude" more synthetic data than its predecessor. The Symbolic Engine has also been sped up considerably and is said to be "two orders of magnitude" faster.

Putting them to the test

To put these to the test, the DeepMind team tasked AlphaProof and AlphaGeometry 2 with solving the six advanced mathematics problems faced by competitors in this year's IMO.

The competition – which dates back to 1959 and sees pre-college mathematicians tackle some of the hardest problems in algebra, combinatorics, geometry, and number theory – has become something of a proving ground for machine learning devs in recent years.

According to DeepMind, the two models were able to complete four of the six problems – AlphaProof solved two algebra problems and one number theory problem, and AlphaGeometry 2 tackled this year's geometry problem. Unfortunately, the latter was no match for the two combinatorics questions.

Tallied up, DeepMind's models still did fairly well, with a score of 28 out of 42 – the equivalent of a silver medal and one point off from gold.

OpenAI unveils AI search engine SearchGPT – not that you're allowed to use it yet
Microsoft adds generative search to its Bing engine
AI models face collapse if they overdose on their own output
Mistral Large 2 leaps out as a leaner, meaner rival to GPT-4-class AI models

However, there seems to be plenty of room for improvement. At least for this competition, DeepMind conceded it was still necessary to translate the problems given to competitors manually into a formal mathematical language the models could understand.

The models also failed to solve the majority of the problems within the allotted time period – which spans two 4.5 hour sessions, spread across two days. While the researchers were able to solve one of the problems within a few minutes, the others took upwards of three days to solve.

The DeepMind researchers are not done yet, of course. They report they are already experimenting with a natural language reasoning system, built on Gemini, that wouldn't require problems to be translated into a formal language and could be combined with other AI systems. That should speed things up a bit. ®

Putting them to the test

Related stories

Other stories