Data mesh is such a new concept that it doesn’t even have its own Wikipedia page yet (in May 2021). Despite this, or maybe because of this, it’s good to know what it’s for.
Data mesh is a data management approach for specific data repositories. It is especially useful when multiple groups, usually groups of analysts, work agilely with large volumes of data. Data mesh isn’t a universal approach. It always coexists with areas of much stricter and more centralized data management in the organization’s environment.
The definition of data mesh is simple. A managed environment must have the following four attributes.
- Domain-based decentralization
- Focus on data products
- Self-service shared support
- Federated management
The first attribute, domain-based decentralization, means that the data from individual domains is handled by those who work with the domains the most and understand them best. For example, for data warehouses, there is often an organizational unit that is responsible for all the data in the warehouse, its integration and its quality. It’s responsible for one version of the truth that is common for all. And that’s the case whether it is financial data, business data, marketing data or legislative reporting data. Data mesh presumes that those who understand the domain best can best manage the data and modify it to the most useful form.
The most important attribute of a data mesh environment is the focus on first-class data products. A data product is a set of data created and maintained for a well-defined purpose. It can be a table, data schema, analysis star or even just a CSV file.
For a dataset to be considered a product, its intended purpose and the author must be clear. It must be evident what it contains and what kind of structure it has. In particular, there is not so much emphasis on what kind of transformations created the data, but only on what the source data was. It is presumed that the author has prepared the best data for the given purpose.
First-class data products must be trustworthy. That’s why it is important to know who the author is. They have to be easy to use, accessible and valuable to users. And therefore, it is important to know why they were created. In a data mesh environment, it is natural that similar information is contained in multiple data products created for different purposes.
The third attribute is a focus on self-service. The user has to be able to find the most suitable product for his needs and easily use it. Plus, he must be able to easily integrate, transform and create new products from existing data products.
The last attribute is federated management. The managed environment is used and modified independently by many entities: individual domain administrators, analysts who use existing data products and create new ones and end-users who only consume existing data products. Nevertheless, everyone must adhere to certain minimum rules guaranteeing the sustainability of the environment. The three most important are:
- Keep a central catalog of data products so that all data products can be found.
- Adhere to a uniform way of describing data products: data structure, author, purpose of creation and, especially, how up-to-date the data in the data product is.
- Use unified solution technologies, especially unified self-service support.
It isn’t clear where exactly to draw the line between centralized and decentralized components. The aim is to have the platform enforce uniformity as much as possible. For example, the central catalog can only be addressed by a naming convention in the directory structure. Metadata about the data product may be required in the form of a JSON file in the product directory. The product data schema can also be stored directly in the data product. Formats such as Avro and Parquet contain a description of the data structure and even support schema changes over time. Messaging tools can be used advantageously as data transformation technologies. Creating data products as messaging topics enables easy self-service data retrieval just by topic subscribing. Moreover, creating another data product is just creating another pipeline that processes existing topics and publishes its own.
Compared to the concept of data warehousing, which focuses on a uniform truth and excellent data quality that suits everyone, data mesh gives individual groups more freedom to address their specific needs and makes it easier to modify the solution. Compared to the data lake concept, which is focused towards the technical and process side of data processing, data mesh concentrates more on the user side. That’s why we will encounter this concept more and more in today’s agile world.
Author: Ondřej Zýka
Information Management Principal Consultant