pwshub.com

Onehouse’s vector embeddings support aims to cut the cost of AI training

Onehouse Inc., a company that sells a data lakehouse based on Apache Hudi as a managed service, today said it has launched a vector embedding generator to automate embedding pipelines as a part of its cloud service.

Vector embeddings are mathematical representations of objects, such as words and images, in a continuous space in which each point is defined by a vector or an ordered list of numbers that represents features of an object, coordinates in a space or complex data type. Embeddings are typically used in machine learning and natural language processing to capture objects’ semantic meaning or other relevant features in a way that a computer can process.

Vector embedding pipelines continuously deliver data from streams, databases and files on cloud storage to foundation models used in generative AI. Onehouse can now accept embeddings that models return and store them in the data lakehouse.

Cheaper storage

That can be a big money saver since vector databases typically require powerful hardware and fast storage tightly coupled with the computer. Vector databases have been the hottest area of the database management system market since the generative AI craze began last year. Forrester Research Inc. estimates that 75% of traditional databases, including relational and NoSQL models, will incorporate vector
capabilities by 2026.

Onehouse is essentially positioning its service as a clearinghouse for vector embeddings. Instead of storing data in a DBMS, customers can take advantage of the low cost of lakehouse storage, which is based on inexpensive, scalable object storage decoupled from computing resources.

“Enterprises need to store a lot of data in their vector databases on local storage, so they need a much bigger vector database instance to get the speed and scalability they need,” said Vinoth Chandar, chief executive of Onehouse and co-creator of Apache Hudi. “Many companies end up running multiple vector databases for different parts of their data so there is no single shared source of truth they can use to manage vector embedding data.”

Hudi has unique capabilities around update management, late-arriving data, concurrency control and other factors needed to scale to the data volumes AI applications need. The company said Onehouse can also support low-latency vector serving for real-time use cases.

The data lakehouse serves vectors in batch, with hot vectors moved dynamically to the vector database for real-time serving. It has scale, cost and performance advantages for applications such as large language models.

Fewer API calls

Chandar said the use of an intermediate lakehouse can also reduce the volume of application program interface calls to LLMs such as OpenAI LLC’s GPT-4 that are needed to generate vector embeddings.

“Hudi is one of the only lakehouse technologies to support advanced indexing and we call incremental queries, so it’s able to drastically reduce the number of calls you need to OpenAI,” or another vector embeddings generator, Chandar said. Incremental queries are a Hudi feature that allows users to efficiently query only the data that has changed since the last query or a specific point in time.

“Hudi can give you a single image, so you can have a job running asynchronously in every arc, and it can make one API call for n updates to an upstream data object,” he said.

Low cost and flexibility are among the major features driving the growing popularity of data lakehouses. An MIT Technology Review survey of senior executives, chief architects and data scientists sponsored by Databricks Inc. found that almost three-quarters of organizations have adopted a lakehouse architecture. Of those, 99% said the lakehouse was helping to achieve their data and AI goals.

Source: siliconangle.com

Related stories
3 weeks ago - A flurry of new artificial intelligence models this week illustrated what’s coming next in AI: smaller language models targeted at vertical industries and functions. Both Nvidia and Microsoft debuted smaller large language models too....
1 month ago - Generative artificial intelligence is demanding breakneck innovation from enterprises. It’s highlighting a critical need for cohesive data management and driving a seismic shift in data storage, processing and utilization. It’s also...
1 month ago - When you sell your primary home, the IRS allows you to exclude a significant portion of the profit from your taxes. This exclusion – $250,000 for single filers and $500,000 for married, joint filers – is large enough that many sellers...
Other stories
13 minutes ago - The Fed's cutting cycle in 1995 sparked an economic boom, with the stock market more than doubling in value by the end of the decade.
13 minutes ago - There's nothing like a potentially massive government contract to win the hearts of both investors and analysts.
1 hour ago - Shares of Truth Social’s parent company fell Thursday, extending the latest round of declines for Trump Media & Technology Group.
1 hour ago - European Union officials are taking new steps to ensure that Apple Inc. complies with the bloc’s DMA tech industry regulation. The European Commission, the EU’s executive arm, announced the initiative today. The DMA is a piece of...
1 hour ago - Shares in automotive chip maker Mobileye Global Inc. jumped nearly 15% today after its majority shareholder, Intel Corp., said that it has no plans to divest its interest in the company. Reports earlier this month suggested that Intel...