Onehouse’s vector embeddings support aims to cut the cost of AI training

Onehouse Inc., a company that sells a data lakehouse based on Apache Hudi as a managed service, today said it has launched a vector embedding generator to automate embedding pipelines as a part of its cloud service.

Vector embeddings are mathematical representations of objects, such as words and images, in a continuous space in which each point is defined by a vector or an ordered list of numbers that represents features of an object, coordinates in a space or complex data type. Embeddings are typically used in machine learning and natural language processing to capture objects’ semantic meaning or other relevant features in a way that a computer can process.

Vector embedding pipelines continuously deliver data from streams, databases and files on cloud storage to foundation models used in generative AI. Onehouse can now accept embeddings that models return and store them in the data lakehouse.

Cheaper storage

That can be a big money saver since vector databases typically require powerful hardware and fast storage tightly coupled with the computer. Vector databases have been the hottest area of the database management system market since the generative AI craze began last year. Forrester Research Inc. estimates that 75% of traditional databases, including relational and NoSQL models, will incorporate vector
capabilities by 2026.

Onehouse is essentially positioning its service as a clearinghouse for vector embeddings. Instead of storing data in a DBMS, customers can take advantage of the low cost of lakehouse storage, which is based on inexpensive, scalable object storage decoupled from computing resources.

“Enterprises need to store a lot of data in their vector databases on local storage, so they need a much bigger vector database instance to get the speed and scalability they need,” said Vinoth Chandar, chief executive of Onehouse and co-creator of Apache Hudi. “Many companies end up running multiple vector databases for different parts of their data so there is no single shared source of truth they can use to manage vector embedding data.”

Hudi has unique capabilities around update management, late-arriving data, concurrency control and other factors needed to scale to the data volumes AI applications need. The company said Onehouse can also support low-latency vector serving for real-time use cases.

The data lakehouse serves vectors in batch, with hot vectors moved dynamically to the vector database for real-time serving. It has scale, cost and performance advantages for applications such as large language models.

Fewer API calls

Chandar said the use of an intermediate lakehouse can also reduce the volume of application program interface calls to LLMs such as OpenAI LLC’s GPT-4 that are needed to generate vector embeddings.

“Hudi is one of the only lakehouse technologies to support advanced indexing and we call incremental queries, so it’s able to drastically reduce the number of calls you need to OpenAI,” or another vector embeddings generator, Chandar said. Incremental queries are a Hudi feature that allows users to efficiently query only the data that has changed since the last query or a specific point in time.

“Hudi can give you a single image, so you can have a job running asynchronously in every arc, and it can make one API call for n updates to an upstream data object,” he said.

Low cost and flexibility are among the major features driving the growing popularity of data lakehouses. An MIT Technology Review survey of senior executives, chief architects and data scientists sponsored by Databricks Inc. found that almost three-quarters of organizations have adopted a lakehouse architecture. Of those, 99% said the lakehouse was helping to achieve their data and AI goals.

Cheaper storage

Fewer API calls

Related stories

Other stories