This post isn’t a textbook on data vault modelling. It’s a reflection on what data vault methodology is good for and why it’s worthwhile to take another look at it today.
At the very end of the last century, two strong opinions emerged on the structure of data warehouses. Bill Inmon defined the basic principles of data warehouses (subject-oriented, time-variant, non-volatile, and integrated) and promoted the relational structure of data. Ralph Kimball postulated the principles of dimensional modelling as a universal pattern for data warehouses. And so, the Inmon versus Kimball data warehousing dispute erupted, which still smoulders sometimes, even now.
In 2000, Dan Linstedt developed his data vault methodology for modelling data warehouses, especially data warehouse cores. But this methodology has long been out of the mainstream. Maybe because it was originally heavily associated with Microsoft technology, which many data warehouse engineers, at the time, turned their noses up at. Maybe it was defined too strictly. Maybe it just had bad marketing. Now, it doesn’t matter anymore.
In contrast, today it is clear that dimensional modelling via the star schema is no panacea, but it has a lot of amazing features.
- Dimensional modelling has relatively few technical rules that have to be followed, so even business users can easily understand it. Simply stated, it divides information into facts and dimensions. It stores facts (especially numerical values) in special fact tables at the granularity of the linked dimensions.
- The star schema exactly corresponds to the analysis of the business process and is very well understood by business users.
- Storing data in a dimensional model ensures excellent performance for predefined queries—queries such as the analysis of metrics of a given process according to predefined dimensions.
- Dimensional modelling, and especially the star schema, has a lot of technological support. Advanced relational databases have special optimization for star-like queries. All cubes in OLAP, MOLAP, ROLAP, HOLAP, and similar technologies are instances of specific star schemas.
But we must not forget the weaknesses of this approach.
- The transformation of relational data into a dimensional model is computationally intensive.
- Queries that do not fit the structure of the dimensional model, and that is most queries from data analysts or data scientists, perform poorly.
- Some model changes, such as adding a dimension to a star schema, are not feasible due to the granularity of the data in the dimensional table. A new star schema must be created and filled with data.
The result is a situation where dimensional modelling holds an irreplaceable place in data marts and for pre-prepared analyses, but it did not work as a universal pattern.
The principles of data vault modelling are also very simple. The foundation is the hub table, which contains a list of unique business keys. Nothing else. The second principle is the storage of all attributes of business entities in satellite tables. And there is a lot of freedom with this. Different satellite tables can be historized differently, the same attributes can be in more satellite tables, and so on. And the third principle addresses the links between business entities. They are all handled the same way, as a link table, whether it be link type 1:1, 1:n, n:m, or multi-way relationships.
The genius of these simple principles is surfacing today, now that data warehouses are facing growth in the amount of data, the variability of input systems and data formats, and the number of data sources. Data vault modelling is the ideal approach to the agile management of data warehouses.
- The principle of hub tables forces analysts and data warehouse engineers to define exactly which entities are involved and how they are identified. This is the basis for order in the data warehouse, the ability to find out what we actually have in the data warehouse, and especially the ability to correctly assign new data sources to existing ones.
- Adding a satellite table allows you to easily and consistently add new data sources or new attributes to existing entities without changing the existing data or data models at all.
- The same principles apply to the storage of transformed (derived) data. It may be advantageous to store both entities with a structure corresponding to the primary systems and domain-oriented transformed entities corresponding to the logical data model of the organization in the core.
It is also important to keep in mind the weakness of this approach.
- Queries for data stored in this model are complicated and inefficient.
Dimensional modelling and data vault modelling have some of the same features. In both cases, there is a relatively simple set of technical rules for the model in which the data is stored. And in both cases, the results depend on the business analysts and their ability to utilize the given principles. For dimensional modelling, the choice of the analysed processes, their metrics, and the required dimensions for analysis are key. For data vault modelling, the definition of identifiers of stored entities and the definition of links between entities are critical.
So, today we have two well-defined approaches to data storage. Data vault modelling for the data warehouse core completely supports the principles: subject-oriented, time-variant, non-volatile, integrated, and dimensional modelling for data marts.
Data vault is not just modelling, but a comprehensive methodology for building data warehouses. Like Inmon and Kimball, Linstedt distinguishes between the stage/core (business vault) data layers and the information delivery layer, where there are data marts with a dimensional model, base tables for reporting, and possibly other structures depending on the user requirements. However, it also defines other areas that are necessary for the data warehouse to function and are outside this basic division.
They are all vaults:
- Business rule vault—area containing the business rules for the data in the data warehouse. Data quality rules for data in all layers are maintained here: rules for integration, definitions for reconciliation reports, and definitions for data tests.
- Error vault—area where data quality and data test results, including erroneous data, are stored.
- Metrics vault—area containing all operational metadata on data warehouse processing and operation.
- Operational vault—area of the data warehouse used for solving master data management and reference data problems. It can be data from these systems or interfaces of specialized external systems.
The biggest benefits here are that these areas are named, their responsibilities are defined, and it is clearly stated that they are an integral part of the data warehouse. Not something sitting around somewhere that may be finished someday.
Data vault modelling is a simple and inexpensive approach to data warehouse expansion that supports agile development and can maintain order in the data warehouse model. The methodology allows you to design and manage data warehouses that are fully auditable and traceable. It also allows you to design and manage warehouses that match an exact copy of the input data (sometimes required by regulations) with aggregated and cleaned data. At present, this methodology is already supported by plenty of data warehouse development tools.
The bottom line is that data vault modelling is a robust way to model the core of a modern data warehouse to weather the storm.
Author: Ondřej Zýka
Information Management Professional