Personal Data Processing: What Data Scientists Don’t Want to Hear

3. 2. 2020

Up until just a few years ago, personal data protection was grossly undervalued. If a person was a bit protective of his data and wanted to do something like see the general terms and conditions with rules for personal data processing before signing a contract, it was strange. If he was paying attention to mobile app permissions or using a special app for encrypted communication, people thought he was doing something illegal or he was just crazy. The most common opinion around could be summed up in one sentence: “Do you have something to hide?” I’ll never forget the time I was at Machine Learning Prague 2018 and they announced that Facebook had developed an internal divorce predictor—the input was messages, phone locations, calls. Success rate? Over 90%.

This situation empowered change in the European Union and the introduction of the regulation with an abbreviation that almost everyone knows, GDPR. But unfortunately, few people have read it, so they often don’t know that most of the obligations laid forth in the regulation were in the former Personal Data Protection Act 101/2000 Coll. The key clause that made the ice start to melt, even among the toughest corporations, was the clause regarding the costs of fines. Businesses are now required to give you a personal data inventory free of charge.

The Wild West days when data scientists told customers “I have your data and I can do whatever I want with it” are over. Bernard Marr, a star in the field of data processing and its subsequent monetization, to my great surprise, in his book Data Strategy, directly recommends maintaining information on the sources and licensing for all data. So, the proposition “It’s published online, we’ve downloaded it, and now we can use it” is no longer valid. Unfortunately, people still thoughtlessly give consent to nearly anyone to process their data.

Here are a few examples of mass data processing that were recently brought to our attention by the media.

The first involves the company Avast, which provides some of the world’s best antivirus protection software. An article by Forbes (only in Czech) described how the company collects data on clients’ online movements, even including information like what they bought on an e-shop. Avast felt that the article misrepresented them, so they responded with an article on Lupa.cz (only in Czech) where they clarified that they don’t use personal data, just depersonalized aggregates. The company also referred to other articles that look closely at the situation. With this information, it would probably be possible for Avast to categorize users according to whether they shop at AliExpress, Košík, or Rohlík and even based on how much they shop in certain parts of a city, for example. Or they could be categorized by which political party they are gravitating toward and who they will probably vote for. This data could be valuable to many people, and it is not possible to avoid its misuse. It is also definitely worth reading the Deník N interview (only in Czech) with Professor Pěchouček, CTO at Avast, where he says “I want even phlegmatics to be paid for their online data.”

A second example, which is about processing personal data, is the so-called TelcoScore (article only in Czech). It is possible to find out the principles behind how these services work from sources that are publicly available.

The TelcoScore is a service provided by the Society for Information Databases (SID). First, they collect information from mobile operators for three months. Then, they use the data they have collected to assign individuals a TelcoScore from 1 to 1,000, with 1 being the worst and 1,000 being the best. Anyone can ask SID for your score to assess your creditworthiness.

In 2018, Jan Cibulka, a reporter for Czech Radio, tried to find out exactly what data is used by TelcoScore. The article cites a valid law that allows people to get detailed information on how their data is processed, but no information was given to him.

As a result, several audits were done by the Office for Personal Data Protection (OPDP). In one of them (articly only in Czech), the OPDP states that the telecommunications company failed to inform the plaintiff of the principles being followed when processing his data.

The third example is the product Clearview AI. In recent years, the company Clearview has downloaded billions of freely available photographs belonging to Internet users, has performed biometric calculations, and offers its users the ability to almost immediately identify people in photographs. Think about how many photos are publicly available online where it is possible to match your face with a name. The product was developed primarily in the USA where it has significantly accelerated the identification of some criminals. Immediately, a discussion broke loose on whether to outlaw such techniques in the Czech Republic. This is a very different approach to the topic compared to a nation such as China. Even recent moves made by the Prague police (article only in Czech) have triggered discussion.

Finally, we have to mention the social networks Facebook, Twitter, Instagram, LinkedIn, Vkontakte, and of course even Google. They have enormous amounts of data on things such as pages visited, friends, colleagues, purchases, etc., which they somehow use. Fortunately, many of them allow you to do a takeout. You can download all the data and think about whether you are sharing more about yourself than you should. A quick test could be to ask yourself whether you would like this data to be available to the general public.

January 28 is Personal Data Protection Day. As we have already mentioned, there are more laws than just GDPR that give you the right to information about how your data is processed. There are forms specifically made for requesting this information, including those at https://github.com/good-lly/gdpr-documents/tree/master/docs/cz_%C4%8Desky.

Celebrate Personal Data Protection Day by valuing your personal data. Ask those who may be processing your personal data what purposes they are using it for.

Author: Marek Sušický

Head of Big Data