Data health vs. Data quality

28. 4. 2022

Do you think it is possible to have a healthy enterprise data environment despite the consumption and storage of poor-quality data?

Caleb Scharf, in his book The Ascent of Information, challenges readers to think about the term “data health”. Let me pick up the gauntlet he has thrown and elaborate a bit on this topic.

Intuitively, I have never been a fan of data quality (DQ) projects. Of course, I understand that data quality is crucial to business success. Yes, I have been certified in data quality management and I know how to define, measure and improve data quality in large companies. However, whenever I hear or read about DQ projects, I feel like they are something unnatural (contrived), quite distant from our real work lives. Moreover, today, we talk about data quality really much less frequently than we did several years ago. Have we given up or just changed our data-cleaning strategy?

This leads to the initial question in the sub-title. Let’s think about it using the following analogy: Can you live a (relatively) healthy life despite eating low-quality food with fat and sugar; drinking alcohol or even smoking cigarettes; breathing polluted air in cities; drinking water with microplastics, dissolved hormones, antibiotics, etc.; and doing many other things we generally consider unhealthy?

First, I will try to define what health is. We know that the health of any living system (be it a cell, organism, society or enterprise) is connected to its ability to keep itself balanced around a certain equilibrium. (This is called homeostasis* and is considered to be vital to life.) But are all organisms always healthy if they just maintain stability by oscillating and balancing at a specific point?

Using your intuition, you can easily distinguish between a healthy tree standing and growing tall in the forest and a sick drying twisted tree crouching nearby. Both are alive, both grow, and even the second one may still be able to spread its seed to reproduce and propagate its genes. But you would never call the second one a healthy tree, although both may continue to survive.

I believe that healthiness is about more than just surviving. Perhaps, it is even about more than just growing. It is about spreading, blooming. (And, ultimately, if the system is not only alive but also sentient, it is about feeling the joy of life.)

If you want to be healthy, there are different strategies to choose from. You can stay locked up at home, always wash your hands twice, disinfect the entire house, make sure you never touch any alcohol, cigarettes, sugar or other so-called drugs, always go to sleep early, etc. But you will soon find that the life outside your bubble is a bit different and that you are not keeping up with the others. Moreover, spending so much of your energy maintaining the desired levels of “purity” may be causing you to feel frustrated, and so—at the end of the day—you do not feel that healthy anyway. Sometimes you may feel stressed or anxious because you broke your own rules; for example, you went to a party and drank “poisonous” alcohol. You are surviving rather than blooming. Another strategy is adaptation: you expose yourself to real life and balance your inner strengths to reach immunity, and you even thrive on everything that comes from outside. Yes, the risk is that you will not successfully manage this strategy and you will lose your balance and health. But if you succeed, your newly-established dynamic balance will allow you to feel healthy, keep up with your strongest competitors and fully enjoy what you do.

I like the opinion of a local holistic medicine practitioner who says that the balance of your health lies around the Pareto optimum. In other words, around 20% of what you expose your body and mind to should be stressors (while the remaining 80% should be healing or healthy factors). If you allow more stressors, you will develop an illness sooner or later. If you allow less, over the long-term you will not be healthy either—you will be more stressed by your puristic habits and lose your ability to keep up with others.

Now, let’s go back to data. In the past, there were many supporters of maximum (or even total) data quality. Data purists. At first sight, they (or, to be honest, we) were right. It is always better to deal with clean, error-free data inputs. You save a lot of money, time and energy if you allow only clean data in. You do not need to invest in the “immune system” of your data environment, you do not need to deal with operational incidents caused by loading bad data, you do not need to run “detox” (data cleansing) projects afterwards, etc. But this approach has never worked. We have always had to deal with poor quality data. We have been forced to process such data instead of halting our data pumps saying: “If you provide data of the desired quality, we will be glad to process it; otherwise, we will stop working.”

The end users of our systems have always pushed us to get our hands dirty with the source data we had. So, we were forced to change our approach if we wanted to “survive” on the market. But in many cases, the pendulum has shifted from one extreme to the other. Instead of continuing with the push for maximum data quality, we have allowed floods of poor data into our systems. As a result of this infectious, bad quality data, our systems have become more and more complex and shifted to an unhealthy state accompanied by cost-inefficiencies, performance issues and sometimes even near-unsustainability.

Now, many organizations need to make difficult decisions on how to modernize their data processing environments. It is completely obvious that they cannot go back to the principles of data quality purism. They need to find a way to stay healthy, even in turbulent external and internal environments that produce huge volumes of unstructured, incomplete, ambiguous, vague and far from ideal data.

Here are a few tips:

1) Metadata

Make your data environment as transparent as you can. That means using and preparing machine-readable metadata defining and describing your data environments to granular levels of atomic details, integrating this metadata, and making it available to everyone who works with data in your company and visualise this metadata as if you were drawing maps.

2) Pareto optimum

Allow low-quality data to be input and data architecture exceptions as much as you have to but no more. For some mysterious reason, you will always end up with a magic equilibrium close to the Pareto optimum, where clean vs. dirty will be around 80:20.

3) Train your systems to be more fit

Build a complex “immune system” in your data processing environment so that you can move forward even in tough conditions. Be prepared for even worse data quality, dynamic changes, complicated business rules and vague business requirements. Let it grow and bloom.

If you are healthy, you are also strong enough to thrive in a “dirty environment” full of infections and toxins. You keep your body and mind in a stable state at around an equilibrium which allows you to go outside even if it is cold and others are coughing and sneezing. The same is true of the data environment in your company.

Now, for a crucial question: How can I check whether my data is healthy? Simple answer: The ultimate indicator is if your people enjoy working with the data.

_______

* Homeostasis is any self-regulating process by which an organism tends to maintain stability while adjusting to conditions that are best for its survival. If homeostasis succeeds, life continues; if it fails, the result is a disaster or the death of the organism.

Author: Petr Hájek
Information Management Advisor