pwshub.com

Mostly AI aims to overcome the AI training plateau with synthetic text based on proprietary datasets

Synthetic data startup Mostly AI Solutions MP GmbH says today it’s looking to solve the challenge of finding enough text-based training data for artificial intelligence models.

Its proposed solution is a new, synthetic text generator that can transform proprietary data into something much more useful for AI developers.

The startup says that finding data to train AI models has become a major headache for developers as they have largely exhausted the most useful, publicly available datasets out there. Of course, most organizations also have their own proprietary data that could be used, but they’re loath to do so due to privacy concerns.

It’s a problem that needs to be solved, according to Mostly AI Chief Executive Tobias Hann, who believes the lack of available training data is having a negative impact on the quality of newer AI models. “AI training is hitting a plateau as models exhaust public data sources and yield diminishing returns,” he insisted.

Mostly AI wants to help developers by taking their proprietary data and using it to generate synthetic text, which can then be used to fine-tune AI. In other words, what it’s doing is transforming proprietary text such as emails, customer support transcripts and chatbot conversations into a resource for AI, without compromising privacy.

The company says the shortage of text for AI training has become extremely acute, citing data from Gartner Inc. that shows how 75% of companies will be using generative AI to create synthetic customer data by 2026, up from less than 5% of companies today.

But the problem with using proprietary, or “real” text data is that it often contains sensitive information, such as customer’s personally identifiable information, which means it cannot be exposed to large language models. In addition, these datasets might not be ideal for LLM training due to a lack of diversity, which results in low-quality outputs.

Synthetic data offers companies an alternative, yet at the same time, it can benefit immensely from being grounded in proprietary data that contains more useful insights relating to the owner’s business.

Putting proprietary data to work in AI

What Mostly AI does is create a synthetic representation of their proprietary text data, which reflects both the text and the structured insights within those proprietary datasets. By uniquely integrating structured and unstructured information, it enables organizations to create a complete and statistically accurate, yet safe-to-use version of their proprietary data assets that can then be used to fine-tune AI systems in a compliant way.

The other thing it does is ensure that its synthetic data is of extremely high quality. According to the startup, its synthetic text generator outperforms rival generative AI models such as GPT-4o by a significant degree.

“When training a downstream text classifier, synthetic text generated by the Mostly AI Platform delivers performance improvement as much as 35% compared to text generated by prompting GPT-4o-mini providing either no or just a few real-world examples,” the company said.

The startup says that with today’s launch of its synthetic text generator, users will be able to take any model from a platform such as Hugging Face and fine-tune it with synthetic data that’s as rich and accurate as their proprietary text.

“To harness high-quality proprietary data, which offers far greater value and potential than the residual public data currently being used, enterprises must take the leap and leverage both structured and unstructured synthetic data to safely train and deploy forthcoming generative AI solutions.” Hann said.

Source: siliconangle.com

Related stories
1 week ago - Artificial intelligence infrastructure is taking really big bucks now to build out, as BlackRock and Microsoft joined this week to invest up to $100 billion in AI data centers and power projects. And that’s not all: Microsoft also teamed...
1 week ago - Microsoft's AI lead is diminishing, and it is too reliant on Nvidia, say D.A Davidson analysts.
1 month ago - All eyes were on Nvidia’s earnings report this week as a proxy for the artificial intelligence economy, and even for the graphics chip giant, it was too much to live up to. Nvidia earnings disappointed, but really, how could they not?...
5 days ago - Former President Donald Trump has recently ramped up his criticism of President Biden and Vice President Kamala Harris, claiming inflation has caused severe financial strain for American families. During a Sept. 13 rally in Las Vegas,...
3 weeks ago - The business is set to explode, but the share price has already anticipated this.
Other stories
27 minutes ago - Charlie Munger's comments on stock portfolio diversification have made waves in the investing world for decades. The late legendary investor repeatedly said that diversification is for "idiots" who don't know what they are doing. During a...
27 minutes ago - Jensen Huang, the CEO and co-founder of Nvidia (NASDAQ:NVDA), has seen his fortune soar to about $106 billion in just five years. Once worth $3.8 billion, Huang is now the 13th richest person in the world, thanks to the surging demand for...
1 hour ago - In a recent conversation with Chris Wallace on CNN Max, Suze Orman highlighted that while the stock market is performing well, it doesn't reflect how most American households are really doing financially. "What if I were to tell you that...
1 hour ago - Cantor Fitzgerald analyst C J Muse reiterated Micron Technology Inc (NASDAQ:MU) with an Overweight rating and a $150 price target. The price target of $150 reflects ~14 times Muse’s calendar year 2025 EPS estimate versus the prior 13...
1 hour ago - Google LLC today announced the launch of two new Chromebook models and artificial intelligence features, powered by Gemini, coming to the company’s lightweight laptop’s ChromeOS. Beginning this month all Chromebooks will now feature chat...