What Is Data Masking and How Do We Do It?
by Michiel Mulders. Modified by Eric Goebelbecker.
With the cost of data breaches increasing every year, there’s a need for higher security standards. According to IBM’s 2021 security report, the average total cost of a data breach has risen to $4.24 million per breach. It’s no wonder why we need more advanced techniques to protect sensitive data. One of those techniques is data masking. In this post, I’ll talk you through the pros and cons of data masking and show you some techniques you can use to mask data. Let’s start with a detailed introduction to data masking.
What Is Data Masking?
Enterprises use data masking or data obfuscation to identify and hide sensitive data. This sensitive data can vary from personal data to intellectual property. There are several ways of data masking, but the purpose is to ensure the data is safe. A common example is a credit card number that has been scrambled or blurred.
Maybe you’ve already come across different types of data masking, such as static or dynamic masking. Static masking is what we call masking data in place. Dynamic data masking is when the masking happens at request time, based on the requestor’s identity.
Why Is Data Masking Important?
As the amount of data we need increases, so does the risk of a data breach. It’s impossible to create foolproof protection for every copy of your data. This is especially true when not everybody with access has the technological literacy we’d hope. But, an enterprise can neutralize several factors that make data breaches so expensive. A good rule of thumb is that no active database should contain unmasked data.
The goal of masking is to protect the data from abuse while still providing developers with data for testing. This reduces the impact of data breaches and improves data security. Data masking achieves this because no actual sensitive customer information links to the values.
Who Uses Data Masking?
Every enterprise that handles personal or private information should consider masking data. Especially larger organizations, which often deal with large amounts of data. These companies should handle their data with extreme care.
In addition to this, some companies have to deal with GDPR. Let’s take a closer look at what that entails.
GDPR and Data Masking
The 2018 General Data Protection Regulation (GDPR) imposes severe penalties for mishandling data. So, many businesses are looking for effective ways of protecting this information.
The most common types of information enterprises want to mask include:
- Personally identifiable information (PII) like name, address, sex, etc.
- Protected health records
- Transaction history and card or payment information
- Intellectual property
All these data types need to be handled with care, as they are subject to GDPR. Companies dealing with these types of data should look for techniques to protect their sensitive data.
Next, let’s dive further into the need for data masking.
Why do We Need Data Masking?
We have already learned about the EU’s GDPR requirements introduced in 2018. However, there are many more reasons to mask your data.
- Protect business data from third-party vendors. Often, businesses have to share data with third-party companies like suppliers, marketing teams, or consultants. When handing over this data, the business loses control over data that should remain confidential. Therefore, data masking can be applied when sharing data with third-party vendors to ensure only the vendors can access the data.
- Safeguard against human error. Human error often lies at the source of a data leak. For example, an operator might turn off the firewall for a database. A small mistake like this one can cause a data leak. A business can safeguard themselves against such data leaks by masking their data. Then, if an attacker gets unauthorized access to the database, the data is obfuscated and useless to them.
- Not all operations require real data. For example, testing an application can be easily executed with randomly generated data. I recommend using randomly generated data during testing, as an application still in the testing phase might leak sensitive data.
Data Masking Techniques
There are many data masking techniques. However, start by assessing what data you’re using and if you’re returning the minimal required data.
You need only the address and age of a certain user in your application. When you query for the user, you could have the app return all underlying data. But from a security standpoint, it’s better to return only the necessary information and reduce the risk of leaking information. Therefore, we can apply a view that returns only the address and age of this user.
Now that we know to limit the data, we return to only the data we need. So, let’s explore five different data masking techniques.
The substitution technique refers to substituting data with similar values. Substitution is an effective technique to replace production data with realistic data.
For example, let’s say we’re returning a user object with a name and address. To mask the data, we can substitute the user’s real name with a fake (but realistic-looking) name. This ensures the combination of name and address cannot identify a person.
Next, data shuffling refers to mixing data. However, we want to ensure we retain logical relationships between data columns in the database. Shuffling is a more advanced technique that masks data while ensuring we have real relationships between the data.
Let’s say we have customer objects in the database linked with purchases. We want to retain this link between the tables for customers and their purchases. So, mix the first and last name of the customer with another customer in the table.
Again, this technique allows you to safely use production data in a test environment.
Another data masking technique is blurring, often used for indirect data identifiers like age.
Thousands of people have the same age. But when enough data points are available, a malicious person might be able to figure out which data points belong to which person. This would mean that the unauthorized person can still identify a user.
Therefore, we can use the blurring technique. Blurring anonymizes data points.
For example, let’s say your application uses the age of users. We can apply a numeric blurring function that creates random noise within a specified range of, for example, realistic-looking ages to populate the age field.
4. Credit Card Masking
Credit card masking is tricky, as valid credit card numbers contain a checksum. The final digit of a credit card holds this checksum number. Therefore, we must pay attention when masking credit cards with random numbers, as we don’t want validation to fail for our masked data. Many tools can generate new credit card numbers with a valid checksum.
5. Nulling Masking
Finally, nullification masking replaces a column of data with a null value. This technique is only used for hiding highly sensitive data that cannot be mixed or blurred. Applying the nullification technique makes it impossible to discover the original value based on the null value.
I want to introduce you to dynamic masking as a final masking technique.
Dynamic Data Masking vs. Traditional Data Masking
We use dynamic data masking in real-time environments where data doesn’t leave the production database. This means that we have a higher level of security for our production data.
With dynamic masking, only authorized users can view the original data. However, the application scrambles the data on the spot for unauthorized users. It’s a performant technique for data masking that protects the production database.
In contrast, traditional data masking doesn’t use such a dynamic layer that can mask the data. With a traditional approach, you copy the production database and decide upon a data masking technique for the production data. After you’re done, you can safely use the data for testing in our testing environment.
Get Started With Data Masking
Before starting data masking, assess what data you are returning. Always make sure to return the minimum required data.
Many techniques exist for masking data. If you want to use production data in your test environment, first assess the type of data you are handling. Based on that, you can choose the right data masking technique for your needs.
The easiest way to get started with data masking is the substitution technique. It allows you to simply switch data with other records making it much harder to identify or link with other records to restore the original record. But the shuffling technique allows you to retain logical relationships in your database.
If you’re working with highly sensitive data, consider using the nullification method. The nullification technique ensures that no sensitive data is exposed.
This post was originally written by Michiel Mulders. Modified for re-publication by Eric Goebelbecker.
Michiel Mulders Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!
Eric Goebelbecker Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).
02NOVEMBER, 2022 by Sylvia Froncza Original March 11 2019An IT and Test Environment Perspective Traditionally, test environments have been difficult to manage. For one, data exists in unpredictable or unknown states. Additionally, various applications and services...
01NOVEMBER, 2022 by Justin Reynolds.Businesses across the board are spinning their tires when it comes to data and analytics, with many of them failing to unlock maximum value from their investments. According to one study, 89% of companies face challenges around how...
02NOVEMBER, 2022 by Eric Boersma *Original 22 October 2019If you're like a lot of developers, you might not think much about software security. Sure, you hash your users' passwords before they're stored in your database. You don't return sensitive information in error...
14 OCTOBER 2022 by Daniel de OliveiraIn today’s application-based world, companies are releasing more applications than ever before. Software delivery life cycles are becoming more complicated. As a result, large companies require hundreds and even thousands of test...
01NOVEMBER, 2022 by EricStaging Server Success: The Essential Guide To Setup and Use Release issues happen. Maybe it's a new regression you didn't catch in QA. Sometimes it's a failed deploy. Or, it might even be an unexpected hardware conflict. How do you catch...
19 NOVEMBER, 2020 by Michiel Mulders What Makes a Good Test Data Manager? Have you implemented test data management at your organization? It will surely benefit you if your organization processes critical or sensitive business data. The importance of test data is...