Google LLC has developed a series of language models that can answer questions about numerical facts more accurately than earlier algorithms.
The DataGemma series, as the model lineup is called, debuted on Thursday. Google has made the source code for the algorithms available on Hugging Face.
DataGemma is optimized to field user questions about statistical facts such the average revenue of companies in a given market segment. It answers queries using information from Data Commons, a free knowledge repository maintained by Google. The repository contains more than 240 billion data points from sources such as the United Nations, the World Health Organization, the CDC and census bureaus.
Under the hood, the DataGemma series is based on Gemma 2 27B, an open-source large language model that Google released in June. Gemma 2 27B powered by the industry-standard Transformer neural network architecture and features 27 billion parameters. Google says that it can rival the performance of LLMs with twice as many parameters.
According to the company, DataGemma is based on a version of Gemma 2 27B that was specifically optimized to process numerical facts. The model interacts with Data Commons, the knowledge repository from which it retrieves those facts, through a natural language search bar.
“Instead of needing knowledge of the specific data schema or API of the underlying datasets, DataGemma utilizes the natural language interface of Data Commons to ask questions,” Google software engineer Jennifer Chen and Prem Ramaswami, the head of Data Commons, detailed in a blog post. “The nuance is in training the LLM to know when to ask.”
Google developed two versions of DataGemma as part of the project. Each takes a different approach to answering user questions.
The first version leverages a method known as RIG, or retrieval-interleaved generation, to process queries. When a user asks a question, the model doesn’t generate an answer based on its internal knowledge base but rather asks Data Commons for the needed information. The LLM then uses the retrieved data to generate a prompt response.
The second version of DataGemma implements the more widely-used RAG, or retrieval-augmented generation, data management method. When a user enters a query, the model retrieves information relevant to the prompt from Data Commons. It then sends the collected information to Google’s more advanced, proprietary Gemini 1.5 Pro model, which generates an answer.
According to MIT Technology Review, the RIG version of DataGemma can correctly retrieve numerical facts from Data Commons about 58% of the time. That’s compared with the 5% to 17% accuracy achieved by the other models that Google evaluated. The RAG version of DataGemma, in turn, generated correct answers to 80% to 94% of the answers it received during the search giant’s tests.
Google plans to improve DataGemma by training it on more information. Additionally, Google will increase the number of questions the LLM series can answer from a few hundred to millions. Further down the line, the company plans to integrate DataGemma’s data retrieval capabilities into its flagship Gemini series of language models.