March 19, 2023 9:10 AM
Data lakes will look nothing like this.
In this age of information, big data is increasingly viewed as the lifeblood of any organization. Yet, because data has become so big and varied, properly analyzing it remains a huge challenge for enterprises.
As such, the business insights that this essential data should be able to yield instead become either too difficult, time-consuming or costly to produce.
One key challenge is the interaction between storage and analytics solutions and whether they can handle these masses of data — or is there a way to skip the storage barrier altogether?
Data storage formats: A history
The timeline for this explosion in big data can be broken into three distinct periods.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
First there was simple text file (TXT) storage, followed by relational database management systems (RDBMS), allowing for easier monitoring and interaction with larger data sets.
The third stage — modern open-source formats like Parquet and Iceberg, which more effectively collect compressed files — resulted from the fact that the capacity of these databases was outpaced by the data they were tasked to collect and analyze.
Then came the stage where database companies would develop their own storage methods in the form of data warehouses. These custom-made, proprietary data storage formats offer better performance and allow data-reliant companies to store their data in ways they can query and handle most effectively.
So, why are data analytics still lagging?
The cost of data warehouses
Despite the customization they afford, data warehouse storage formats come with a slew of drawbacks.
These warehouses’ ingestion protocols require enterprise data to undergo pre-processing before entering the warehouse, so queries are delayed. There is also no single source of “truth,” as the sync process between the originating storage location (where data, still in its raw format, is created) and the data warehouse is complex and can skew datasets.
Vendor lock-in is another issue, as the query-able data from any storage format location is often closed for only one application, and thus not always compatible with the various tools required for data analytics. Lastly, anytime a department wants to analyze its data, the data sources need to be duplicated, which can result in convoluted and sometimes impossible data sharing between different data warehouses.
As these shortcomings become increasingly prominent and pose greater challenges for data-driven enterprises, the fourth chapter of the data storage saga is unfolding.
Enter the “data lake.”
Diving into the data lake
Unlike a data warehouse (and the walled-in, finite nature that its name implies), a data lake is fluid, deep and wide open. For the first time, enterprises of any size can save relevant data from images to videos to text in a centralized, scalable, widely accessible storage location.
Because these solutions, with their inlets and tributaries and the fluid nature of their storage formats, are designed not only for data storage but with data sharing and syncing in mind, data lakes aren’t bogged down by vendor lock-in, data duplication challenges or single truth source complications.
Combined with open-source formats such as Apache Parquet files — which are effective enough to manage the analytic needs across various silos within an organization — these unique storage systems have empowered enterprises to successfully work within a data lake architecture and enjoy its performance advantages.
The house on the lake
Although data lakes are a promising storage and analytics solution, they are still relatively new. Accordingly, industry experts are still exploring the potential opportunities and pitfalls that such cloud compute capabilities may have on their storage solutions.
One attempt to overcome the current disadvantages is by combining data lake capabilities with data warehouse organization and cloud computing — dubbed the “data lakehouse” — essentially a data warehouse floating atop a data lake.
Consider that a data lake is just a collection of files in folders: Simple and easy to use, but unable to pull data effectively without a centralized database. Even once data warehouses had developed a way to read open-source file formats, the challenges of ingestion delays, vendor lock-in, and a single source of truth remained.
Data lakehouses, on the other hand, allow enterprises to use a look-alike-database processing engine and semantic layer to query all their data as is, with no excessive transformations and copies, while maintaining the advantages of both methods.
The success of this combined approach to data storage and analytics is already encouraging. Ventana Research VP and research director Matt Aslett predicts that by 2024, more than three-quarters of data lake adopters will be investing in data lakehouse technologies to improve the business value of their accumulated data.
Enterprises can now enjoy the analytical advantages of SQL databases as well as the cheap, flexible storage capabilities of a cloud data lake, while still owning their own data and maintaining separate analytical environments for every domain.
How deep does this lake go?
As data companies increasingly adopt cloud data lakehouses, more and more enterprises will be able to focus on one of the most critical assets of business today — complex analytics on big datasets. Instead of bringing their data into hosting engines, enterprises will actually be bringing high level engines to whatever data they need analyzed.
Thanks to the low entry barriers of cloud data lakehouses, where hardware allocation can be achieved in just a few clicks, organizations will have easily accessible data for every conceivable use case.
Data lakehouse vendors will continue to be tested on their ability to deal with bigger datasets without auto-scaling their compute resources to infinity. But even as the technology progresses, the data lakehouse method will remain consistent in its ability to allow data independence and give users the advantages of both data warehouses and data lakes.
The waters of the data lake may seem untested, but it is increasingly apparent that vendors and enterprises that don’t take the plunge won’t fulfill their data potential.
Matan Libis is VP of product at SQream.