Data Warehouse Development Is Like an Endless Relay Race

endless marathonSome time ago we organized a workshop for one of our customers as an introduction to a less popular field called BI governance. It can be loosely defined as a system of internal processes for the operation and continuous development of data warehouses, such as change management, configuration management, release management and so on. We talked about the need for atomic architecture that splits the entire data warehouse into the smallest building blocks. We discussed the versioning of these blocks, we analysed ways of deploying them to different production and non-production environments and talked about testing, locking parts of the system with ongoing changes, etc. The customer’s experts nodded, agreed and even disagreed, asked questions and planned their next steps. We agreed that their main goal should be to convince a wider audience in their organization that BI governance was really needed, so they were going to explain to them again, perhaps in a more accessible way, what we were talking about.

Analogies are always inaccurate and misleading. Nevertheless, I cannot resist making one: The operation and development of a data warehouse (or virtually any robust software system) can be thought of as a very special kind of endurance run, so it is a bit like a relay race. Since a good coffee is the only kind of doping allowed, feel free to indulge in one before you continue reading.

The rules are quite simple. Our runners will be the basic (let’s say atomic) elements of the software system (to keep it simple, let’s imagine individual SQL scripts for the data warehouse). All these elements will be assigned something like a starting number. If you change the element or add a new one, we will assign it a new starting number. Whoever does not have a starting number does not race and is not part of the system; this could be a spectator along the track or an organizer (and you really need to be careful that the spectators and organizers do not mix with the runners). Changes to the starting list can only be made via formal application so that each (properly marked and numbered) application can be associated with specific changes in the starting numbers. This is quite administratively difficult and unpopular for all racing teams, but in return, we can let them make changes very often (i.e. not only before a new race year or new stages but in every round, in case of accidents or technical problems and whenever anything exceptional occurs). That is all, the race can begin.

In reality, it will be even more complicated because the race will also be run on several test circuits, where the cloned racers from the main field (for some reason we call it production) will be mixed with their improved clones. These improved clones and new reinforcements must first go through all the test circuits before they become part of the main and real race.

Another detail is that it is actually not a race, it is not about victory—it is about keeping the system together and functioning as a whole to our general satisfaction. As with the Tour de France, Formula 1, the Dakkar Rally, Ski Classics and other similar events, with data warehouses, it is absolutely crucial that show goes on—sponsors have already spent incredible amounts of money (and intend to pump in more), viewership is high, there are many partners, customers and suppliers that are connected to the company in some way, and really no one wants to listen to someone say “it is not working again” every morning.

So how do you organize (and manage) the whole thing? As far as those famous races, I do not know. In the case of data warehouses, I can guess. First of all, it is architecture—it must enable the system to be decomposed into atomic units (our runners), the more elementary the better. However, architecture cannot deal with every atom. It will only define the basic building blocks, and from them, categories emerge to which the atoms will belong. It is better to have fewer of these building blocks. Typical data warehouse building blocks are generally data objects (i.e. mostly database tables or data files) and their data elements (i.e. usually columns in tables), then data transformations (i.e. SQL scripts or their precisely defined fragments) and orchestration rules (i.e. instructions that must be processed sooner, later, can run simultaneously, etc.). That is all. We do not need any other types of building blocks, although individual atoms, or more precisely artefacts, can number as many as tens or hundreds of thousands in the target solution. (If a type of building block is missing, let’s say data quality control, we try to organize it under what we already have—maybe under data transformations).

But we have omitted one building block, and that is metadata. Metadata is closely tied to architecture, which defines the necessary metadata model, but at the same time, the specific metadata content itself will be the set of artefacts that make up the system. Metadata, whether descriptive (defining the system) or perhaps process or operational (in our analogy, they carry information about the course of the race, its results and intermediate results, etc.), must support the basic rules for system development—thus recording artefacts, assigning starting numbers, recording starting lists for all environments (production, testing, etc.), recording all required changes and linking the changes to changes in the starting numbers (versions) of the artefacts.

If we have architecture and metadata, everything else is just mechanics, it depends on the chosen tools. Maybe we can do without a special tool for managing our own architecture (text and image documentation could be sufficient), but we certainly need a tool for data (and metadata modelling) and content management, especially metadata (Excel might suffice only in the beginning and for simple solutions). I set aside my own development environment (development of my own artefacts and their templates). Another important factor will be the choice of tools for self-managing artefact versions, preparing deployment packages and recording change requests and linking them to artefact and metadata versions.

So, this has been my attempt to explain the somewhat unusual basics of BI governance. I will repeat that all analogies are inaccurate and simplistic. I really appreciate critical comments from actual software engineering experts and responses from business readers. It is absolutely crucial that the representatives of both camps, although sometimes very polarized, understand each other as much as possible.


Author: Petr Hájek

Information Management Advisor