Today, every modern company faces the crucial decision on the form of its data architecture. Two markedly different approaches are often mentioned in this context: Data Lake and Data Warehouse. Historically, both have always pursued a common aim: to acquire an in-depth understanding of a company’s business activity using data and to support managerial decision-making at all levels through its processing. Over time, machine decision-making also began to supplement managerial decision-making (whether through simple decision models or more complex statistical models and models generated by machine learning).
As is often the case, each of the two approaches has its advantages and disadvantages, as well as its proponents and opponents. It was therefore only a matter of time before someone applied the classic marketing trick of miraculously combining the benefits of both predecessors in such an attractive way that the shortcomings get overlooked. And to make it clear to everyone, they also combined the names of the predecessors, creating the poetic-sounding name “Lake House.”
A Tour of the Warehouse or a Dip in the Lake?
Let’s recall that the much older concept of a “classic” data warehouse offers a structured and organized approach to comprehensive data processing and analysis. It thereby provides a solid foundation on which to build a reliable and consolidated data model. This model defines a single version of the truth, ensuring that a company’s different departments work with identical data and come to similar conclusions. The core of a Data Warehouse provides a Business Intelligence platform that facilitates mutual understanding and interpretation of the company’s key performance indicators. This approach is the basis for effective management, planning, and strategic decision-making. The downsides of data warehouses are their centralization and cumbersomeness, which is reflected in the lengthy process of implementing changes (time-to-market) and the costs for their operation and further development.
In contrast, a Data Lake offers greater overall flexibility and the ability to process unstructured and sometimes partly unknown data, regardless of volume. One of the main benefits is its ability to respond quickly to changing requirements and facilitate the creation of analytical outputs. Access to data within a Data Lake is more democratic, allowing a broader range of users to participate in analysis and report creation without significant technical and organizational constraints. On the downside, there might be a certain lack of organization, inconsistency, incomparability of outputs, and a lessened ability of checking activities over data (whether from the perspective of company efficiency management or information security).
The Lake House Dream
The Lake House concept combines the best of both worlds: the flexibility and speed of access provided by Data Lake and the structure and organization of Data Warehouse. One of the main tools is the consistent use of metadata in all stages of work. This is truly commendable. However, Lake House focuses only on specific aspects of systems of working with data, specifically those of a primarily technical nature. In other words, on aspects that are typically perceived by individual users of data systems: how quickly can I find out what data I can access and how fast can I access this data?
Indeed, there are many “use cases” that only impact the activities of individual departments or agile squads. However, the most challenging aspect of working with data is coordinating activities at the company-wide level and achieving a mutual understanding of company data and its shared interpretation.
Adopting the Lake House concept can easily lead to the dissipation of data truth. Data truth, meanwhile, is crucial for achieving agreement among the different parts of a company and ensuring everyone is working with the same information. The proverbial “single version of the truth” can only function on a consolidated data model with historically stable (consistent, comparable) data. And such a model can only be established and maintained in a solution based on Data Warehouse principles.
The Truth
In the end, while the Lake House concept is innovative and offers certain advantages, we must not overlook the key principle of the Data Warehouse – one version of data truth for the entire company. Achieving an optimal data architecture that supports flexibility and speed while ensuring consistency and reliability of shared data across the entire company might be possible through a synthesis of both approaches.
However, if we unquestioningly accept the claims of the self-sufficiency of Lake House we might, instead of living our dream in a house by the lake, end up in a houseboat. Not that that wouldn’t be an alluring lifestyle too, but after all, it’s not really built on the firm ground most modern companies require to succeed.
Author: Petr Hajek