What Is Data Masking and How Do We Do It?


MAY, 2022

by Michiel Mulders. Modified by Eric Goebelbecker.

With the cost of data breaches increasing every year, there’s a need for higher security standards. According to IBM’s 2021 security report, the average total cost of a data breach has risen to $4.24 million per breach. It’s no wonder why we need more advanced techniques to protect sensitive data. One of those techniques is data masking. In this post, I’ll talk you through the pros and cons of data masking and show you some techniques you can use to mask data. Let’s start with a detailed introduction to data masking.

What Is Data Masking?

Enterprises use data masking or data obfuscation to identify and hide sensitive data. This sensitive data can vary from personal data to intellectual property. There are several ways of data masking, but the purpose is to ensure the data is safe. A common example is a credit card number that has been scrambled or blurred.

Maybe you’ve already come across different types of data masking, such as static or dynamic masking. Static masking is what we call masking data in place. Dynamic data masking is when the masking happens at request time, based on the requestor’s identity.

Why Is Data Masking Important?

As the amount of data we need increases, so does the risk of a data breach. It’s impossible to create foolproof protection for every copy of your data. This is especially true when not everybody with access has the technological literacy we’d hope. But, an enterprise can neutralize several factors that make data breaches so expensive. A good rule of thumb is that no active database should contain unmasked data.

The goal of masking is to protect the data from abuse while still providing developers with data for testing. This reduces the impact of data breaches and improves data security. Data masking achieves this because no actual sensitive customer information links to the values.

Who Uses Data Masking?

Every enterprise that handles personal or private information should consider masking data. Especially larger organizations, which often deal with large amounts of data. These companies should handle their data with extreme care.

In addition to this, some companies have to deal with GDPR. Let’s take a closer look at what that entails.

Every enterprise that handles personal or private information should consider masking data.

GDPR and Data Masking

The 2018 General Data Protection Regulation (GDPR) imposes severe penalties for mishandling data. So, many businesses are looking for effective ways of protecting this information.

The most common types of information enterprises want to mask include:

  • Personally identifiable information (PII) like name, address, sex, etc.
  • Protected health records
  • Transaction history and card or payment information
  • Intellectual property

All these data types need to be handled with care, as they are subject to GDPR. Companies dealing with these types of data should look for techniques to protect their sensitive data.

Next, let’s dive further into the need for data masking.

Why do We Need Data Masking?

We have already learned about the EU’s GDPR requirements introduced in 2018. However, there are many more reasons to mask your data.

  1. Protect business data from third-party vendors. Often, businesses have to share data with third-party companies like suppliers, marketing teams, or consultants. When handing over this data, the business loses control over data that should remain confidential. Therefore, data masking can be applied when sharing data with third-party vendors to ensure only the vendors can access the data.
  2. Safeguard against human error. Human error often lies at the source of a data leak. For example, an operator might turn off the firewall for a database. A small mistake like this one can cause a data leak. A business can safeguard themselves against such data leaks by masking their data. Then, if an attacker gets unauthorized access to the database, the data is obfuscated and useless to them.
  3. Not all operations require real data. For example, testing an application can be easily executed with randomly generated data. I recommend using randomly generated data during testing, as an application still in the testing phase might leak sensitive data.

Data Masking Techniques

There are many data masking techniques. However, start by assessing what data you’re using and if you’re returning the minimal required data.

You need only the address and age of a certain user in your application. When you query for the user, you could have the app return all underlying data. But from a security standpoint, it’s better to return only the necessary information and reduce the risk of leaking information. Therefore, we can apply a view that returns only the address and age of this user.

Now that we know to limit the data, we return to only the data we need. So, let’s explore five different data masking techniques.

1. Substitution

The substitution technique refers to substituting data with similar values. Substitution is an effective technique to replace production data with realistic data.

For example, let’s say we’re returning a user object with a name and address. To mask the data, we can substitute the user’s real name with a fake (but realistic-looking) name. This ensures the combination of name and address cannot identify a person.

2. Shuffling

Next, data shuffling refers to mixing data. However, we want to ensure we retain logical relationships between data columns in the database. Shuffling is a more advanced technique that masks data while ensuring we have real relationships between the data.

Let’s say we have customer objects in the database linked with purchases. We want to retain this link between the tables for customers and their purchases. So, mix the first and last name of the customer with another customer in the table.

Again, this technique allows you to safely use production data in a test environment.

3. Blurring

Another data masking technique is blurring, often used for indirect data identifiers like age.

Thousands of people have the same age. But when enough data points are available, a malicious person might be able to figure out which data points belong to which person. This would mean that the unauthorized person can still identify a user.

Therefore, we can use the blurring technique. Blurring anonymizes data points.

For example, let’s say your application uses the age of users. We can apply a numeric blurring function that creates random noise within a specified range of, for example, realistic-looking ages to populate the age field.

4. Credit Card Masking

Credit card masking is tricky, as valid credit card numbers contain a checksum. The final digit of a credit card holds this checksum number. Therefore, we must pay attention when masking credit cards with random numbers, as we don’t want validation to fail for our masked data. Many tools can generate new credit card numbers with a valid checksum.

5. Nulling Masking

Finally, nullification masking replaces a column of data with a null value. This technique is only used for hiding highly sensitive data that cannot be mixed or blurred. Applying the nullification technique makes it impossible to discover the original value based on the null value.

I want to introduce you to dynamic masking as a final masking technique.

Dynamic Data Masking vs. Traditional Data Masking

We use dynamic data masking in real-time environments where data doesn’t leave the production database. This means that we have a higher level of security for our production data.

With dynamic masking, only authorized users can view the original data. However, the application scrambles the data on the spot for unauthorized users. It’s a performant technique for data masking that protects the production database.

In contrast, traditional data masking doesn’t use such a dynamic layer that can mask the data. With a traditional approach, you copy the production database and decide upon a data masking technique for the production data. After you’re done, you can safely use the data for testing in our testing environment.

Get Started With Data Masking

Before starting data masking, assess what data you are returning. Always make sure to return the minimum required data.

Many techniques exist for masking data. If you want to use production data in your test environment, first assess the type of data you are handling. Based on that, you can choose the right data masking technique for your needs.

The easiest way to get started with data masking is the substitution technique.

The easiest way to get started with data masking is the substitution technique. It allows you to simply switch data with other records making it much harder to identify or link with other records to restore the original record. But the shuffling technique allows you to retain logical relationships in your database.

If you’re working with highly sensitive data, consider using the nullification method. The nullification technique ensures that no sensitive data is exposed.

Post Author

This post was originally written by Michiel Mulders. Modified for re-publication by Eric Goebelbecker.

Michiel Mulders Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!

Eric Goebelbecker Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).

Relevant Articles

Sand Castles and DevOps at Scale

03JUNE, 2022 by Niall Crawford & Carlos "Kami" Maldonado. Modified by Eric Goebelbecker.DevOps at scale is what we call the process of implementing DevOps culture at big, structured companies. Although the DevOps term was back in 2009, most organizations still...

Test Environment Management Explained

Test Environment Management Explained3JUNE, 2022 by Erik Dietrich, Ukpai Ugochi, and Jane Temov. Modified by Eric GoebelbeckerMost companies spend between 45%-55% of their IT budget on non-production activities like  Training, Development & Testing and lose 20-40%...

Serverless Computing for Dummies

3JUNE, 2022 by Eric GoebelbeckerWhat Is Serverless Computing? Serverless computing is a cloud architecture where you don’t have to worry about buying, building, provisioning, or maintaining servers. In return for structuring your code around their APIs, your cloud...

Test Environments – The Tracks for Agile Release Trains

25MAY, 2022 by Niall Crawford & Justin Reynolds. Modified by Eric Goebelbecker.So, you’ve decided to implement a Scaled Agile Framework (SAFe) and promote a continuous delivery pipeline by implementing “Agile Release Trains” (ART)*.  Definition: An Agile Release...

Test Environments: Why You Need One and How to Set It Up

24MAY, 2022 by Keshav MalikWith the rise of agile development methodologies, the need to quickly test new features is more critical than ever. This is especially true for websites and applications that rely on real-time data and interaction. The only way to ensure...

What is Data Masking? And Best Practice!

20MAY, 2022 by Jane TemovMost organizations employ strong security measures to keep production data secure while being made available for day-to-day business activity. However, Data may be utilized for less secure activities like testing and training, or by third...