Select Page

Securing Data by Masking



by Michiel Mulders

With the cost of data breaches increasing every year, there’s a huge need for higher security standards. According to IBM’s 2019 security report, the average total cost of a data breach has risen to $3.92 million per breach.

It’s no wonder why we need more advanced techniques to protect sensitive data. One of those techniques is data masking.

In this post, I’ll talk you through the pros and cons of data masking and show you some of the techniques you can use to mask data.

Let’s start with a detailed introduction to data masking.


What Is Data Masking?

Data masking can be applied to any kind of application. However, most often it’s enterprise applications that deal with sensitive data. Data masking, also referred to as data obfuscation, is used to hide data.

Generally, masking data entails converting data into random characters or other formats. This makes leaked data useless to unauthorized users. However, a user with the right permissions or authorization is able to see the unmasked data.

Put simply, data masking is a very simple technique for hiding data from unauthorized users and protecting a company’s sensitive data.

Who Uses Data Masking?

In general, we can say that every enterprise that handles personal or private information should consider implementing data masking. This is especially true for larger organizations, as they often deal with large amounts of data. It’s not always easy for a larger organization to get a full view on their data. Therefore, these companies should handle their data with extreme care.

In addition to this, some companies have to deal with GDPR regulations. Let’s take a closer look at what that entails.

GDPR and Data Masking

As of the 2018 General Data Protection Regulation (GDPR) act, European companies have to comply with new requirements. This means that many businesses are looking for effective ways of protecting their sensitive data.

The most common types of information enterprises want to mask include:

  • Personally identifiable information (PII) like name, address, sex, etc.
  • Protected health records
  • Transaction history and card or payment information
  • Intellectual property

All these types of data need to be handled with care, as they are subject to GDPR. Companies dealing with these types of data should look for techniques to protect their sensitive data.

Next, let’s dive further into the need for data masking.

Why Do We Need Data Masking?

We already learned about the new GDPR requirements that were introduced in 2018. However, there are many more reasons why you want to mask your data.

  1. Protect business data from third-party vendors. Often, businesses have to share data with third-party companies like suppliers, marketing teams, or consultants. When handing over this data, the business basically loses control over data that should remain confidential. Therefore, data masking can be applied when sharing data with third-party vendors to make sure only the vendors can access the data.
  2. Safeguard against human error. Human error often lies at the source of a data leak. For example, an operator might turn off the firewall for a database. A small mistake like this one can cause a data leak. A business can safeguard themselves against such data leaks by masking their data. Then, if an attacker gets unauthorized access to the database, the data is obfuscated and useless to them.
  3. Assess data usage. Not all operations require real data. For example, testing of an application can be easily executed with randomly generated data. I even recommend to use randomly generated data during testing, as an application that’s still in the testing phase might leak sensitive data.

If you want to learn more about data masking, read about Enov8’s data compliance case study.

Data Masking Techniques

There are many data masking techniques that you can use. However, first start by assessing what data you’re using and if you’re returning the minimal required data.

Let’s say you need only the address and age of a certain user in your application. As you’re querying for the whole user object, you might just return the whole data object. But from a security standpoint, we want to return only the necessary information to reduce the risk of leaking too much information. Therefore, we can apply a view that returns only the address and age of this user.

Now that we know to limit the data we return to only the data we need, let’s explore five different data masking techniques.

1. Substitution

The substitution technique refers to substituting data with similar values. Substitution is an effective technique to replace production data with realistic data.

For example, let’s say we’re returning a user object with a name and address. In order to mask the data, we might want to substitute the user’s real name with a fake (but realistic-looking) name. This ensures the combination of name and address cannot identify a person.

2. Shuffling

Next, data shuffling refers to mixing data. However, we want to make sure we retain logical relationships between data columns in the database. Shuffling is a more advanced technique that masks data while making sure we have realistic relationships between the data.

Let’s say we have customer objects in the database that are linked with purchases. We want to retain this link between the tables for customers and their purchases. Therefore, we use shuffling to mix the values for the first and last name of the customer with another customer in the table.

Again, this technique allows you to safely use production data in a test environment.

3. Blurring

Another data masking technique is blurring, which is often used for indirect data identifiers like age.

Thousands of people have the same age. But when enough data points are available, a malicious person might be able to figure out which data points belong to which person. This would mean that the unauthorized person can still identify a user.

Therefore, we can use the blurring technique. Blurring basically anonymizes data points.

For example, let’s say your application uses the age of users. We can apply a numeric blurring function that creates random noise within a specified range of, for example, realistic-looking ages to populate the age field.

4. Credit Card Masking

Credit card masking is more tricky, as valid credit cards can be verified by using a checksum. The final digit of a credit card holds this checksum number. Therefore, we have to pay attention when masking credit cards with random numbers, as we don’t want the credit card validation to fail for our masked data. Many tools exist that can generate new credit card numbers with a valid checksum.

5. Nullification Masking

Finally, nullification masking replaces a column of data with a null value. This technique is only used for hiding highly sensitive data that cannot be mixed or blurred. By applying the nullification technique, you make sure it’s impossible to discover the original value based on the null value.

As a final masking technique, I want to introduce you to dynamic masking.

Dynamic Data Masking vs. Traditional Data Masking

Dynamic data masking is used in real-time environments where data doesn’t leave the production database. This means that we have a higher level of security for our production data.

With dynamic masking, only authorized users can view the authentic data. However, for unauthorized users, the data is scrambled on the spot, returning inauthentic data. It’s a very performant technique for data masking that protects the production database.

In contrast, traditional data masking doesn’t use such a dynamic layer that can mask the data. With a traditional approach, you make a copy of the production database and decide upon a data masking technique to be applied to the production data. When the technique has been applied, we can safely use the data for testing in our testing environment.

Get Started With Data Masking

Before you get started with data masking, assess what data you are returning. Always make sure to return the minimum required data.

Many techniques exist for masking data. If you want to use production data in your test environment, first assess the type of data you are handling. Based on that, you can choose the right data masking technique for your needs.

The easiest way to get started with data masking is the substitution technique. It allows you to simply switch data with other records making it much harder to identify or link with other records to restore the original record. But personally, I like the shuffling technique, as it allows you to retain logical relationships in your database. However, shuffling is a more advanced method for masking data.

I’ll leave you with this: If you’re working with highly sensitive data, I always recommend using the nullification method. The nullification technique ensures that no sensitive data can be exposed.

Michiel Mulders

Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!

Relevant Articles

Test Environment Management for Dummies

We often get asked by people “What is TEM (Test Environment Management), well for those of you looking for a quick overview of Test Environment Management, here is Use Case we developed as a way…

DevOps versus SRE – Friend or Foe

19 MARCH, 2020 by Michiel Mulders SRE vs DevOps: Friends or Foes? Nowadays, there’s a lack of clarity about the difference between site reliability engineering (SRE) and development and operations (DevOps). There’s definitely an overlap between the roles, even though...

Site Reliability Engineering (SRE) Top 10 Best Practice

06 MARCH, 2020 by Arnab Roy Chowdhury Top 10 SRE Practices Do you know what the key to a successful website is? Well, you’re probably going to say that it’s quality coding. However, today, there’s one more aspect that we should consider. That’s reliability. There are...

What Is Data Literacy? (aka Know Your Data)

20 FEBRUARY, 2020 by Arnab Row Chowdhury   Technically, the world today has advanced to a level we never could’ve imagined a few years ago. What do you think made it possible? We now understand complexities. And how do you think that became possible? Literacy! Since...

What Is SRE (Site Reliability Engineering)?

14 FEBRUARY, 2020 by Michiel Mulders A site reliability engineer loves optimizing inefficient processes but also needs coding skills. He or she must have a deep understanding of the software to optimize processes. Therefore, we can say an SRE contributes directly to...

Data Compliance: What It Is and Why You Should Care

07 February, 2020 by Arnab Roy Chowdhury Do you remember what Uncle Ben said to young Peter Parker? “With great power comes great responsibility.” The same applies to companies. At present, businesses hold a huge amount of data—not only the data of a company but also...