What Is Data Masking and How Do We Do It?

MAY, 2022

by Michiel Mulders.

Modified by Eric Goebelbecker.

Authors

This post was originally written by Michiel Mulders. Modified for re-publication by Eric Goebelbecker.

Michiel Mulders Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!

Eric Goebelbecker Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).

 

With the cost of data breaches increasing every year, there’s a need for higher security standards. According to IBM’s 2021 security report, the average total cost of a data breach has risen to $4.24 million per breach. It’s no wonder why we need more advanced techniques to protect sensitive data. One of those techniques is data masking.

Enov8 Test Data Manager

*aka ‘Data Compliance Suite’

The Data Securitization and Test Data Management platform. DevSecOps your Test Data & Privacy Risks.

 In this post, I’ll talk you through the pros and cons of data masking and show you some techniques you can use to mask data. Let’s start with a detailed introduction to data masking.

What Is Data Masking?

Enterprises use data masking or data obfuscation to identify and hide sensitive data. This sensitive data can vary from personal data to intellectual property. There are several ways of data masking, but the purpose is to ensure the data is safe. A common example is a credit card number that has been scrambled or blurred.

Maybe you’ve already come across different types of data masking, such as static or dynamic masking. Static masking is what we call masking data in place. Dynamic data masking is when the masking happens at request time, based on the requestor’s identity.

Why Is Data Masking Important?

As the amount of data we need increases, so does the risk of a data breach. It’s impossible to create foolproof protection for every copy of your data. This is especially true when not everybody with access has the technological literacy we’d hope. But, an enterprise can neutralize several factors that make data breaches so expensive. A good rule of thumb is that no active database should contain unmasked data.

The goal of masking is to protect the data from abuse while still providing developers with data for testing. This reduces the impact of data breaches and improves data security. Data masking achieves this because no actual sensitive customer information links to the values.

Who Uses Data Masking?

Every enterprise that handles personal or private information should consider masking data. Especially larger organizations, which often deal with large amounts of data. These companies should handle their data with extreme care.

In addition to this, some companies have to deal with GDPR. Let’s take a closer look at what that entails.

Every enterprise that handles personal or private information should consider masking data.

GDPR and Data Masking

The 2018 General Data Protection Regulation (GDPR) imposes severe penalties for mishandling data. So, many businesses are looking for effective ways of protecting this information.

The most common types of information enterprises want to mask include:

  • Personally identifiable information (PII) like name, address, sex, etc.
  • Protected health records
  • Transaction history and card or payment information
  • Intellectual property

All these data types need to be handled with care, as they are subject to GDPR. Companies dealing with these types of data should look for techniques to protect their sensitive data.

Next, let’s dive further into the need for data masking.

Why do We Need Data Masking?

We have already learned about the EU’s GDPR requirements introduced in 2018. However, there are many more reasons to mask your data.

  1. Protect business data from third-party vendors. Often, businesses have to share data with third-party companies like suppliers, marketing teams, or consultants. When handing over this data, the business loses control over data that should remain confidential. Therefore, data masking can be applied when sharing data with third-party vendors to ensure only the vendors can access the data.
  2. Safeguard against human error. Human error often lies at the source of a data leak. For example, an operator might turn off the firewall for a database. A small mistake like this one can cause a data leak. A business can safeguard themselves against such data leaks by masking their data. Then, if an attacker gets unauthorized access to the database, the data is obfuscated and useless to them.
  3. Not all operations require real data. For example, testing an application can be easily executed with randomly generated data. I recommend using randomly generated data during testing, as an application still in the testing phase might leak sensitive data.

Data Masking Techniques

There are many data masking techniques. However, start by assessing what data you’re using and if you’re returning the minimal required data.

You need only the address and age of a certain user in your application. When you query for the user, you could have the app return all underlying data. But from a security standpoint, it’s better to return only the necessary information and reduce the risk of leaking information. Therefore, we can apply a view that returns only the address and age of this user.

Now that we know to limit the data, we return to only the data we need. So, let’s explore five different data masking techniques.

1. Substitution

The substitution technique refers to substituting data with similar values. Substitution is an effective technique to replace production data with realistic data.

For example, let’s say we’re returning a user object with a name and address. To mask the data, we can substitute the user’s real name with a fake (but realistic-looking) name. This ensures the combination of name and address cannot identify a person.

2. Shuffling

Next, data shuffling refers to mixing data. However, we want to ensure we retain logical relationships between data columns in the database. Shuffling is a more advanced technique that masks data while ensuring we have real relationships between the data.

Let’s say we have customer objects in the database linked with purchases. We want to retain this link between the tables for customers and their purchases. So, mix the first and last name of the customer with another customer in the table.

Again, this technique allows you to safely use production data in a test environment.

3. Blurring

Another data masking technique is blurring, often used for indirect data identifiers like age.

Thousands of people have the same age. But when enough data points are available, a malicious person might be able to figure out which data points belong to which person. This would mean that the unauthorized person can still identify a user.

Therefore, we can use the blurring technique. Blurring anonymizes data points.

For example, let’s say your application uses the age of users. We can apply a numeric blurring function that creates random noise within a specified range of, for example, realistic-looking ages to populate the age field.

4. Credit Card Masking

Credit card masking is tricky, as valid credit card numbers contain a checksum. The final digit of a credit card holds this checksum number. Therefore, we must pay attention when masking credit cards with random numbers, as we don’t want validation to fail for our masked data. Many tools can generate new credit card numbers with a valid checksum.

5. Nulling Masking

Finally, nullification masking replaces a column of data with a null value. This technique is only used for hiding highly sensitive data that cannot be mixed or blurred. Applying the nullification technique makes it impossible to discover the original value based on the null value.

I want to introduce you to dynamic masking as a final masking technique.

Dynamic Data Masking vs. Traditional Data Masking

We use dynamic data masking in real-time environments where data doesn’t leave the production database. This means that we have a higher level of security for our production data.

With dynamic masking, only authorized users can view the original data. However, the application scrambles the data on the spot for unauthorized users. It’s a performant technique for data masking that protects the production database.

In contrast, traditional data masking doesn’t use such a dynamic layer that can mask the data. With a traditional approach, you copy the production database and decide upon a data masking technique for the production data. After you’re done, you can safely use the data for testing in our testing environment.

Get Started With Data Masking

Before starting data masking, assess what data you are returning. Always make sure to return the minimum required data.

Many techniques exist for masking data. If you want to use production data in your test environment, first assess the type of data you are handling. Based on that, you can choose the right data masking technique for your needs.

The easiest way to get started with data masking is the substitution technique.

The easiest way to get started with data masking is the substitution technique. It allows you to simply switch data with other records making it much harder to identify or link with other records to restore the original record. But the shuffling technique allows you to retain logical relationships in your database.

If you’re working with highly sensitive data, consider using the nullification method. The nullification technique ensures that no sensitive data is exposed.

Other Reading

Enjoy what you read? Here are a few more articles that you might find interesting.

Enov8 Blog: Types of Test Data you should use for your Software Tests?

Enov8 Blog: Why TDM is so Important!

Enov8 Blog: What is Data Fabrication in TDM?

Relevant Articles

Deployment RunBooks (aka Runsheets) Explained in Depth

Deployment RunBooks (aka Runsheets) Explained in Depth

Deploying software releases can be a challenging and complex process. Even small changes to a software system can have unintended consequences that can cause downtime, user frustration, and lost revenue. This is where deployment runbooks come in. A deployment runbook,...

11 Key Benefits of Application Portfolio Management

11 Key Benefits of Application Portfolio Management

In digital‑first organizations, the application landscape is vast and constantly evolving. Departments add tools to meet immediate needs, legacy systems stick around for years, and new technologies emerge faster than they can be evaluated.  It’s like finding your...

11 Application Portfolio Management Best Practices

11 Application Portfolio Management Best Practices

Managing an enterprise application portfolio is no small feat. Over time, even the most disciplined organizations can end up with dozens—or even hundreds—of applications scattered across departments, many of which overlap in functionality or have outlived their...

Understanding The Different Types of Test Environment

Understanding The Different Types of Test Environment

As businesses continue to rely on software to carry out their operations, software testing has become increasingly important. One crucial aspect of testing is the test environment, which refers to the setup used for testing. This article focuses on the various types...

Data Masking in Salesforce: An Introductory Guide

Data Masking in Salesforce: An Introductory Guide

Salesforce is a powerhouse for managing customer relationships, and that means it often stores your most sensitive customer data. But not every Salesforce environment is equally secure. Developers, testers, and training teams often work in sandbox environments that...

Release Dashboards: How to Improve Visibility and Control

Release Dashboards: How to Improve Visibility and Control

When software releases go wrong, it’s rarely because someone dropped the ball. Usually, it’s because no one had a clear picture of what was happening. Without visibility, things slip through the cracks. Deadlines get missed, bugs sneak in, and teams spend their time...