Last month at Česká spořitelna, in cooperation with Profinit and MANTA, a long-awaited hackathon took place, devoted to efficient extraction of information on data lineage. The successful event brought several very interesting insights for participants not only from the organizing companies, but also for students and the professional public.
Using the MANTA tool, we prepared in advance a huge database describing in very detailed resolution all the paths through which data flows from source systems towards end users across robust data solutions such as data lake, data warehouse, analytical databases or final reports.
The assignment was straightforward: take a prepared package of metadata and “retell” the story of the complex and richly branched internal data flows to give, ideally, a clear and simple picture that is readable for the average person.
Participants could choose from three technologies: The Oracle relational database, the Neo4j graph database, and the Databricks cloud platform enabling the use of the Spark language or specific libraries for data processing. To visualize the outputs, it was possible to use either Neo4j itself, or the open source PlantUML tool or the dedicated Cluemaker tool.
The first pleasant discovery was the interest in the Neo4j graph database. Even though most registered participants announced Oracle as their preferred technology, a significant number of participants jumped at the opportunity to try out the graph query language. The start of the hackathon was thus partly reminiscent of a workshop, in which more experienced graph specialists helped members of other teams master the basics of Neo4j. It quickly became clear that there were many other “use cases” at a large bank far beyond data lineage where a graph database would be a very suitable tool and it would pay off in the long term to build competence in this technology.
Another insight was that even just making available “rough” metadata describing data lineage can significantly save human work in many daily tasks, such as various impact analyses or frequent error tracing in reports. Since MANTA provides a clear generic metadata model, it is easy for a wide range of users, even those with just basic data analysis skills, to quickly find answers to their questions.
The preparation itself of the hackathon, during which it was necessary to deal with many security requirements, also showed that the metadata core does not contain any sensitive data on data lineage, and therefore it is possible to freely distribute it to a wide range of analysts. In other words, the general rule that while the data itself (and bank data especially) needs to be carefully protected, metadata should be transparent and widely available.
But the most important thing was the confirmation of the expected fact that metadata conceals a great wealth of information. For example, at the beginning of the whole event, we gave a lecture about the most common cases of using data lineage, and we were surprised how many other useful examples the participants themselves found for their everyday work.
More along the lines of a humorous curiosity, was the hyperbolic attempt to calculate how much the wedding of one of its client costs the bank – in other words, how many UPDATE operations on the column containing the surname in individual tables and how many historization records must be performed to incorporate these changes.
Another example was the effort to calculate how much electricity is used to generate a specific report for the European Central Bank. However, the available metadata sample would obviously not be sufficient for this task. It would, however, be possible to discuss other ways to enrich metadata with other resources, such as the financial costs of running specific data platforms or the performance of specific servers.
And here is the main finding for me personally: we would achieve a rocket-like increase in utility by systematically integrating metadata from different sources and combining technical, operational and “business” metadata.
The key principle must be to eliminate manual work and maximize the machine processing of metadata. In the final presentations, we saw, among other things, the use of ready-made machine learning models, which aimed to better “prune” the complex tree of dependencies, either horizontally (simplification of hierarchical dependencies of data objects, clustering of tables, etc.) or vertically (skipping and screening of unimportant intermediate steps in data processes).
The hackathon confirmed that investing in the processing and use of metadata (and more, by far, than only metadata describing technical data lineage) is decidedly worthwhile, and that strategic foresight in this area will bring key competitive advantages to large organizations in the very near future.
Author: Petr Hájek
Information Management Advisor