I have already commented on the role the core layer plays in traditional data warehouse solutions. Even if this is not the youngest of concepts, it still has significant benefits to offer. Today, I would like to discuss more about the options that are available when you approach the core layer as a state of mind rather than a database layer.
Let’s start with the purpose of the core layer – it allows transformation of data from source systems’ data models into more universal structures, enables integration from different data sources into the same data model, provides single source of truth and it efficiently provides data to subsequent solutions. Of course, there are many functions that can be added to the list, such as data quality checks, but I believe that we can do just fine with the main functions mentioned above.
I am convinced that at least the main reasons behind the creation of a core layer are still valid – when there is more than one source system, there is still the need to integrate and combine. Users must be assured that the data displayed across different reports originates from the same consistent source of data. But is the creation of core layer the only answer to this need?
A core layer database is certainly not the only answer to this problem, even if it does serve the main purposes described above well. Some approaches combine data democratization with responsibility over given domain or data area (e.g. data mesh). Other theories propose a more technical approach that focuses on the efficiency of the transformation (e.g. data vault) and pushes most business-related transformations into the upper levels of the data warehouse solution closer to users (or lower, depending whether you see the glass half empty or half full). There are numerous approaches being offered as a part of toolset to build a data solution (e.g. Databricks, Snowflake), where the core layer is moved more into the metadata layer, and exists less as a dedicated database.
These mentioned solutions have their pros and cons – data mesh offers a perfect solution for a functioning agile organization, where product owners create data sets and the best one prevails on the internal data market. On the other hand, it requires mature users and a specific organizational setup, where the responsibility is clearly defined and the strategy is well understood and well combined with other company processes. A technical approach like data vault allows increased automation that is metadata driven (at least for the raw data vault level), but still the business level transformation must be done somewhere (information mart and business data vault). It might be easier, as the basic transformation on the lower levels is automated and metadata driven, but also might not. Some tools provide flagging and tagging utilities to label the parts of the code that, from architecture perspective, belong to the core layer. Nevertheless, even if you do not need a big core in such case, you need a reliable metadata solution and solid processes to ensure reliability of these labels.
In other words – if you need to dedicate an effort of 100 to transform the raw data into the desired outcome, it is naive to believe that any approach, methodology or tool will decrease this required effort to 10. I believe that it might be feasible to reduce this effort by 20% when the use case ideally matches the selected approach (or tool, or methodology). So in the end, even if your core layer is purely virtual, it still requires some effort to keep it real and working. However, some owners prefer more substantial proving of the core layer state of mind – and so the core layer as a transformation phase / database layer still exists in many solutions and it is being used in brand new data solutions as well.
Nevertheless, I have seen some data solutions that applied the core as a state of mind concept instead of database level concept, and it has worked well so far. However, I am talking mostly about smaller companies with a limited number of source systems, with small teams and with very lean processes. I am afraid that from a certain scale of the solution/team/company, this approach will become unsustainable, or the effort to manage the metadata around such a data solution will dramatically increase. Another case when such an approach is efficient is prototyping or building solutions of a limited size/complexity, where the core layer is simply an overhead at the given moment.
Despite the efforts to maximize the level of transformation automation, there will always be business related transformation, which is very specific. And it does not matter what kind of data solution you have; this transformation must be executed (or at least designed) with minimal automation. Business related data transformation is also the phase with the biggest added value – in this phase the data is combined with business know how to give relevant results to the users.
Since it seems too hard to automate business level transformations, it does not mean that we should resign from the idea to increase the automation, especially with regards to available technologies. This kind of automation can (and should) be applied during the analysis phase and mainly for the transformation that can be metadata driven. I have in mind activities like data classification, data profiling, etc. Check out the recent Gartner articles to find more about emerging trends in this domain.
What is the conclusion? Can a virtual core layer effectively provide the same level of services as a physical database core layer? I believe it can, but only under specific conditions. Apart from nimble enterprises that start from scratch, companies with a strong data culture can also benefit from this virtual approach. Experienced teams and skilled users are more likely to benefit from such a solution, rather than being dragged down by its complexity. That is why, in my opinion, solutions with robust core databases will continue to exist, because for some companies it will take a lot of time to develop data governance culture and data literacy to a level that will enable efficient utilization of the virtual core layer. Let’s dig a bit deeper into the influence of data governance culture on data solution design next time.
Author: Tomáš Rezek