Introduction of the GDPR has got companies working with large volumes of personal data treading on eggshells. More than a year after and some companies are still unsure whether they are up to par when it comes to upholding the formidable regulation.
Even when using a modern platform for data management, can firms be confident that they are doing everything right with handling personal data? The matter gets more complicated when the organisations’ operation runs on personal data. And no industry works with more personal data than the airline industry, from passenger flights data, loyalty programs, to purchase data, etc. Finnair, Finland’s largest airline, lately encountered a challenge regarding their data management. The way they solved it is really insightful and can serve as an example for other companies dealing with large quantities of personal data.
Finnair deals with about 40,000 passengers on a daily basis, and each passenger is connected with a great amount of passenger data.
Finnair and personal data
Jarkko Venna, Senior Data Engineer at Finnair, sheds light onto the Finnish airline’s experience with handling personal identifiable information in a modern data platform.
Finnair deals with about 40,000 passengers on a daily basis, and each passenger is connected with a great amount of passenger data. This data translates into Finnair’s data sets, which contain a huge volume of personal data that Finnair should be able to remove or handle retention on it at any moment, in line with GDPR.
A brand new platform with brand new challenges
In order to provide some context, at the time when the interview was conducted, Finnair was in the process of developing a new modern cloud-based data platform. They were experiencing some issues with their then-current platform, such as siloed systems and hard-to-maintain personal data that was all over the place. These were the reasons for Finnair to start working on a new platform that will unite all data sources.
However, during the process of implementation of the new platform, Finnair came across some conflicting requirements. On the one hand, there were usage needs. They experienced an influx of data which needed to be available for analysis faster. They were aiming to introduce data democratisation which grants access to data to everyone so they are able to do modelling and analysis based on the data. The number of data sources increased and the number of people using the data than previously.
On the other hand, they had privacy and security requirements, GDPR and the “right to be forgotten”. They needed to keep in mind the data retention – meaning a company can only store personal information if it has a valid reason to store it. And on top of that, privacy and security dictate that an organisation cannot use production personal data in development, although it’s not possible to do meaningful data development on development data. “You need to have production data in order to get anything out of the process,” Jarkko explains.
Finnair needed to solve these challenges to move forward with the introduction of the data management platform. To Jarkko’s presentation, Finnair had to find a way to
- Keep storing large amounts of data
- Control access to personal data
- Have the data easily available and usable
- Add new data easily
- Remove the personal data when requested
- Remove personal data after a person’s data retention period has expired.
And at the same time, they should be able to
- Keep the removal process manageable, and
- Not cause too many problems for data usability.
So, how did Finnair solve their data management problem with handling huge volumes of personal data?
You need to have production data in order to get anything out of the process.
The personal data handling strategy
“I’ve been thinking about this for a couple of years now and I came up with only one feasible strategy when you start having a lot of data’’, says Jarkko. The solution that Jarrko came up with is handling personal data on ingestion.
The basic idea is that every time a company gets data in, they take it and replace it with a token, and then store the token and actual values in a mapping table – put simply. As you’d expect there are positive and negative sides to this strategy.
The positive aspects are:
- Personal data is stored in one place, while real personal values are stored in another place. They are kept separate so they don’t cause any personal data leakages.
- Most of the data, apart from personal data, is available to all people of the organisation. And there is no danger of unauthorized employees getting access to sensitive customer data such as names, emails, phone numbers.
- Retention or removal of data is easy to carry out because data is localized in one place, it’s not scattered all over various systems.
- Source data that’s pushing already removed values can be filtered out since the processing is done on the ingestion side.
But at the same time, some of the negative aspects of this strategy are:
- The ingestion flow gets more complex because personal data needs to be handled. It doesn’t work like a typical data lake where you store the data and forget about it until you need it. The data should be processed as it’s ingested.
- Data content needs to be known before ingestion.
- The real values have to be joined with the data.
The basic idea is that every time a company gets data in, they take it and replace it with a token, and then store the token and actual values in a mapping table.
The data ingestion process in the background
The algorithm for this kind of data ingestion starts with reading the data. For each personal data field, Finnair identifies what is the customer or employee ID related to every specific personal data item. This step important for the latter stage where the ID is needed to remove the value.
If the person ID is already in the mapping table with a removed status, the field value is set to “removed”. The real value is not passed forward because it’s removed and it shouldn’t appear in the system.
If the value is removed, there are 3 steps to follow:
- Calculate the key based on the value, the customer ID and other fields to the mapping table
- Replace the field value with the key
- Write the real value to the mapping table with the key and person’s ID
- Output the actual data with the value replaced with the key.
At the end, instead of one, there are two data sets – one containing the mapping with keys from values, and the other with personal values that have been replaced by keys in the mapping table.
Finnair’s data management solution enables providing wide and granular access to data without compromising personal data security and it simplifies the removal of the data process.
Requirements for the mapping table
The mapping table has certain requirements that need to be fulfilled. Jarkko advises enough additional data and metadata to be stored so the removal process can be performed. Also for the removal process, the mapping table should contain the customer ID because that’s the key used in the removal process. The type of the key should also be stored because multiple data sources might occur and the same values and the type of key are necessary to distinguish the lines. And lastly, the value type should be stored, which enables enhancements, optimisation and querying of the mapping table. Jarkko mentioned metadata as the last requirement that should be included. It should contain info such as when the data was stored in the mapping table or when it was removed and the reason for the removal.
Some practical tips for implementation
In theory, the process looks nice and easy, but in practice, things might get a bit messy, especially when the mapping tables start growing. So here are several practical tips that will come in handy in the process:
- It’s a good idea to create a parametrised process. You can code it once, and parametrise it to be used multiple times.
- Make things easy for you and for your team by standardising the incoming personal data items before storing them in the mapping table. For e.g. lower case emails, upper case names, etc.
- Store values that are of the same type as the actual value. For example, use “removed” for strings, numbers (e.g. “1”) for numbers and time and date formats (e.g. ”1800-01-01 00:00:00”) for timestamps and dates.
If you want to go step ahead with enhancement, Jarrko shares his ideas about creating multiple mapping tables. For privacy and security reasons, he suggests creating different versions of the same mapping:
- Standard – this is the mapping table used in normal situations
- Development – where the real values are replaced by random values. With this version of the mapping table, companies are able to do any kind of development that requires mapping in the development environment without using personal values in it.
- Legal – version of mapping table which stores customer information for legal purposes. This situation arises when customers ask for the removal of data, there are legal reasons why a company needs to still be able to identify a customer. The access to the legal mapping table is by default much more restricted than the standard one.
Regarding different areas in data, there are employees in different departments that need different access rights to a different kind of personal data. Some examples of the mapping tables that can be created are:
- Mapping table with Customer data – people that are working with customer data
- Mapping table with HR data – people that are working with employee data.
The key takeaways
Finnair’s solution for a modern data management platform is one smart and practical way of how companies from all industries dealing with large volumes of personal data can make sure they handle it correctly. Finnair’s data management solution enables providing wide and granular access to data without compromising personal data security and it simplifies the removal of the data process.
Do you want to hear first-hand advice for handling personal data in a modern data platform?
The Data Innovation Summit has gone 100% Online and become a Global event!
You can now join the summit from the comfort of your home or office, and enjoy the unparalleled content shared through the program. The entire program will be streamed LIVE through the event platform Agorify between 18th to 21st of August 2020.
Register on the link below to get your online ticket and listen to more than 300 sessions delivered by the leading data-driven companies in the world!