It has been more than 20 years since building data warehouses first became a frequently discussed topic at most larger companies. It was logical, as they had realized that their businesses and the whole market were becoming more and more complex and connected to data. The connection to data came about in two ways – they produced lots of data themselves, and also, lots of other data was needed for them to better understand their businesses (and even predict their future development, because in those days people still believed they could predict the future :).
So they had lots of data and also many questions to be answered using this data. Let’s look back at the main reasons why they decided to build data warehouses. First, they had to integrate data from different sources both logically and technically. Second, they needed to keep historical (unchanged) records of the data collected (it is very likely they had read Orwell and believed that anyone who controlled the past also controlled the future :). And third, they had to agree on a single interpretation of this data, or what was called the single version of the truth. To keep it short, it turned out that building data warehouses was the right thing to do. And understanding business through data was called Business Intelligence.
In recent years, we have gradually moved on to the next level of the game. The data around us has grown to such complexity that we first need to understand the data itself before we can use it to understand what it is saying about anything else. I suggest we call this discipline Data Intelligence.
Metadata is defined as data about data. It describes our data and answers our questions about it. Metadata helps us better understand our data. The more data we have, the more we need metadata. And we really have lots of data, not only in primary systems but also in the previously mentioned data warehouses and even in scary objects called data lakes… And there are more requirements for our data. Just one of the many examples is that we are supposed to categorize all elements of our data and decide which of them are personal data, due to the necessity to protect it more rigorously than other data…
We have many categories of metadata – technical metadata, business metadata, operational metadata, etc. But not only do we have metadata that describes existing data, we can also have metadata produced in the design phases of the system development cycle – some people call it “prescriptive” as it prescribes what the new data should look like. We also have different sources of metadata, even pointing to the same data. Anybody who has had the opportunity to participate in even a small data migration or integration project certainly knows what I am writing about. It is so often the case that the logical model documentation is quite far from what was implemented as the physical model or that it has changed over time. So, if you run any reverse-engineering processes, you will certainly get a very different metadata description than what was in the original documentation… And moreover, everything changes – not only the data but also the metadata – so the time dimension is very important here as well.
We have all the DWH use cases here, and the situation is very similar to the case with data and data warehouses:
We need to integrate the metadata both technically (ideally into the same technology platform) and logically (into the unified metadata model); we need to keep historical versions of the metadata; and we need to agree on the unique interpretation of the metadata. As with Business Intelligence, which is supported by data warehouses, Data Intelligence needs to be supported by “metadata warehouses”.
We already know that in turbulent environments (which today’s world indisputably is) it is not possible to predict the future. But it is possible to be prepared for whatever it brings. Therefore, we need to maintain our metadata and, ideally, keep it well organized in systems similar to data warehouses.