pwshub.com

Microsoft advances data management with open formats and AI integration

Five years ago, if one were to talk about open data formats or governance, they might end up putting others to sleep. But today, it’s become the most important conversation going.

It’s clear that data has evolved. That evolution poses certain advantages for customers, according to Dipti Borkar (pictured), vice president and general manager at Microsoft Corp.

John Furrier and Sanjeev Mohan of theCUBE discussed open data formats with Dipti Borkar, vice president and general manager at Microsoft, during Supercloud 7.

Microsoft’s Dipti Borkar talks with theCUBE about open data formats during Supercloud 7.

“These data formats and table formats, on top of the file formats, essentially give our customers a choice,” Borkar said. “It’s opened up, which means that they can have computes that they can choose on top as well. Multiple different computes can run on these formats. That’s the beauty of it. That’s a great value to customers, which means they can do more with their data.”

Borkar spoke with theCUBE Research’s John Furrier and Sanjeev Mohan at the Supercloud 7: Get Ready for the Next Data Platform event, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed the importance of open data formats and the evolving role of data management in the cloud.

Microsoft shifts to open data formats

Microsoft has made the decision to move from its closed-source format to pure open formats with Microsoft Fabric in particular. That was a pretty dramatic change, according to Borkar.

“[We moved] all our engines to reengineer these computes, to now read these native formats,” she said. “We support Delta Lake, and Iceberg is landing very soon. The reason that these are important, again, customers get a choice.”

Companies could run a variety of engines on top and interrupt between platforms. That includes running AI with Databricks or Snowflake, according to Borkar.

“You can interrupt. We have a layer with OneLake that supports these open formats, which allow customers to interrupt, so that you’re not locked in, you can do more with your data. You don’t have to move it around,” she said. “You can actually leave it in place, reduce your cost and get value.”

There are three main open data formats — Delta, Iceberg and Apache Hudi. All three have their own specific way of writing data, and all were built for different use cases, according to Mohan.

“Hudi was built for streaming ingest, and Iceberg … does not support streaming ingest. So when you write in a particular table format, that becomes your primary format,” Mohan said. “The compatibility is only at read-only level.”

That’s because it’s not possible for one to write some piece of data into Delta and then instruct it to make copies into other formats, according to Mohan. That’s because the latency would be too high.

“The fine print … is so important,” he said. “Anytime anyone says this is open source, this is compatible, you really have to take it to the next level of detail to understand what is open-source, what is compatible.”

Combining structured, semi-structured, unstructured data efficiently

Today, Microsoft is seeing a combination of structured, semi-structured and unstructured data going into the lake, according to Borkar. The structured data is essentially open table formats.

“Typically, you would build semantic models on top. For example, with Power BI you have a semantic model, and our Copilot then operates on that semantic model and is available for natural language questions,” Borkar said. “Just using that approach, you can essentially use English to come up with a dashboard, right? Instantaneously.”

For semi-structured and unstructured data, that’s where models directly operating on top of data comes in, according to Borkar. For Microsoft, that includes Azure AI Search.

“[That provides] both the vector indexing capabilities directly on this data, but also keyword-based indexing. So, it’s actually a combination, which is very powerful, because in some cases you might need one,” Borkar said. “In some cases, vector indexing is more powerful, and it applies an internal ranking and gives the best results back out. So, AI Search, on top of OneLake, for example, is one of the patterns that we are also starting to see.”

This is done, essentially, using the ChatGPT versions of Copilot, according to Borkar. All told, it’s a development that has evolved very quickly.

“Now you have a stream of structured data, you’ve thrown in your semi-structured and unstructured data,” Borkar said. “Your vector index is on top of that, and now you’re building generative AI applications.”

Stay tuned for the complete video interview, part of SiliconANGLE’s and theCUBE Research’s coverage of the Supercloud 7: Get Ready for the Next Data Platform event.

Photo: SiliconANGLE

Source: siliconangle.com

Related stories
1 month ago - Amid a glut of funding for artificial intelligence companies, there’s understandably increasing concern among investors this past week, apparent in disappointment in the earnings results of a number of technology companies, whether all...
5 days ago - Oracle Corp. is seeing renewed business momentum powered by a combination of an entrenched database business, years of investment in cloud infrastructure, an integrated application suite and artificial intelligence technologies that are...
1 month ago - The trend has been clear for some time: Data is changing, the script is flipping and the intelligent data platform is rising. Generative artificial intelligence is set to change the data layer completely. The ongoing platform shift was a...
3 weeks ago - We believe enterprise applications are undergoing a profound change. By next year, highly capable agentic systems will emerge to create new application classes and alter the way organizations think about their backend systems, data...
1 month ago - Three main pressure points are transforming the modern data landscape: 1) Increased interest in adopting open table formats to allow any compute to operate on any data; 2) The point of control is shifting from the database management...
Other stories
3 minutes ago - The Fed's cutting cycle in 1995 sparked an economic boom, with the stock market more than doubling in value by the end of the decade.
3 minutes ago - There's nothing like a potentially massive government contract to win the hearts of both investors and analysts.
1 hour ago - Shares of Truth Social’s parent company fell Thursday, extending the latest round of declines for Trump Media & Technology Group.
1 hour ago - European Union officials are taking new steps to ensure that Apple Inc. complies with the bloc’s DMA tech industry regulation. The European Commission, the EU’s executive arm, announced the initiative today. The DMA is a piece of...
1 hour ago - Shares in automotive chip maker Mobileye Global Inc. jumped nearly 15% today after its majority shareholder, Intel Corp., said that it has no plans to divest its interest in the company. Reports earlier this month suggested that Intel...