Securing Data by Masking
by Michiel Mulders
With the cost of data breaches increasing every year, there’s a huge need for higher security standards. According to IBM’s 2019 security report, the average total cost of a data breach has risen to $3.92 million per breach.
It’s no wonder why we need more advanced techniques to protect sensitive data. One of those techniques is data masking.
In this post, I’ll talk you through the pros and cons of data masking and show you some of the techniques you can use to mask data.
Let’s start with a detailed introduction to data masking.
What Is Data Masking?
Data masking can be applied to any kind of application. However, most often it’s enterprise applications that deal with sensitive data. Data masking, also referred to as data obfuscation, is used to hide data.
Generally, masking data entails converting data into random characters or other formats. This makes leaked data useless to unauthorized users. However, a user with the right permissions or authorization is able to see the unmasked data.
Put simply, data masking is a very simple technique for hiding data from unauthorized users and protecting a company’s sensitive data.
Who Uses Data Masking?
In general, we can say that every enterprise that handles personal or private information should consider implementing data masking. This is especially true for larger organizations, as they often deal with large amounts of data. It’s not always easy for a larger organization to get a full view on their data. Therefore, these companies should handle their data with extreme care.
In addition to this, some companies have to deal with GDPR regulations. Let’s take a closer look at what that entails.
GDPR and Data Masking
As of the 2018 General Data Protection Regulation (GDPR) act, European companies have to comply with new requirements. This means that many businesses are looking for effective ways of protecting their sensitive data.
The most common types of information enterprises want to mask include:
- Personally identifiable information (PII) like name, address, sex, etc.
- Protected health records
- Transaction history and card or payment information
- Intellectual property
All these types of data need to be handled with care, as they are subject to GDPR. Companies dealing with these types of data should look for techniques to protect their sensitive data.
Next, let’s dive further into the need for data masking.
Why Do We Need Data Masking?
We already learned about the new GDPR requirements that were introduced in 2018. However, there are many more reasons why you want to mask your data.
- Protect business data from third-party vendors. Often, businesses have to share data with third-party companies like suppliers, marketing teams, or consultants. When handing over this data, the business basically loses control over data that should remain confidential. Therefore, data masking can be applied when sharing data with third-party vendors to make sure only the vendors can access the data.
- Safeguard against human error. Human error often lies at the source of a data leak. For example, an operator might turn off the firewall for a database. A small mistake like this one can cause a data leak. A business can safeguard themselves against such data leaks by masking their data. Then, if an attacker gets unauthorized access to the database, the data is obfuscated and useless to them.
- Assess data usage. Not all operations require real data. For example, testing of an application can be easily executed with randomly generated data. I even recommend to use randomly generated data during testing, as an application that’s still in the testing phase might leak sensitive data.
If you want to learn more about data masking, read about Enov8’s data compliance case study.
Data Masking Techniques
There are many data masking techniques that you can use. However, first start by assessing what data you’re using and if you’re returning the minimal required data.
Let’s say you need only the address and age of a certain user in your application. As you’re querying for the whole user object, you might just return the whole data object. But from a security standpoint, we want to return only the necessary information to reduce the risk of leaking too much information. Therefore, we can apply a view that returns only the address and age of this user.
Now that we know to limit the data we return to only the data we need, let’s explore five different data masking techniques.
The substitution technique refers to substituting data with similar values. Substitution is an effective technique to replace production data with realistic data.
For example, let’s say we’re returning a user object with a name and address. In order to mask the data, we might want to substitute the user’s real name with a fake (but realistic-looking) name. This ensures the combination of name and address cannot identify a person.
Next, data shuffling refers to mixing data. However, we want to make sure we retain logical relationships between data columns in the database. Shuffling is a more advanced technique that masks data while making sure we have realistic relationships between the data.
Let’s say we have customer objects in the database that are linked with purchases. We want to retain this link between the tables for customers and their purchases. Therefore, we use shuffling to mix the values for the first and last name of the customer with another customer in the table.
Again, this technique allows you to safely use production data in a test environment.
Another data masking technique is blurring, which is often used for indirect data identifiers like age.
Thousands of people have the same age. But when enough data points are available, a malicious person might be able to figure out which data points belong to which person. This would mean that the unauthorized person can still identify a user.
Therefore, we can use the blurring technique. Blurring basically anonymizes data points.
For example, let’s say your application uses the age of users. We can apply a numeric blurring function that creates random noise within a specified range of, for example, realistic-looking ages to populate the age field.
4. Credit Card Masking
Credit card masking is more tricky, as valid credit cards can be verified by using a checksum. The final digit of a credit card holds this checksum number. Therefore, we have to pay attention when masking credit cards with random numbers, as we don’t want the credit card validation to fail for our masked data. Many tools exist that can generate new credit card numbers with a valid checksum.
5. Nullification Masking
Finally, nullification masking replaces a column of data with a null value. This technique is only used for hiding highly sensitive data that cannot be mixed or blurred. By applying the nullification technique, you make sure it’s impossible to discover the original value based on the null value.
As a final masking technique, I want to introduce you to dynamic masking.
Dynamic Data Masking vs. Traditional Data Masking
Dynamic data masking is used in real-time environments where data doesn’t leave the production database. This means that we have a higher level of security for our production data.
With dynamic masking, only authorized users can view the authentic data. However, for unauthorized users, the data is scrambled on the spot, returning inauthentic data. It’s a very performant technique for data masking that protects the production database.
In contrast, traditional data masking doesn’t use such a dynamic layer that can mask the data. With a traditional approach, you make a copy of the production database and decide upon a data masking technique to be applied to the production data. When the technique has been applied, we can safely use the data for testing in our testing environment.
Get Started With Data Masking
Before you get started with data masking, assess what data you are returning. Always make sure to return the minimum required data.
Many techniques exist for masking data. If you want to use production data in your test environment, first assess the type of data you are handling. Based on that, you can choose the right data masking technique for your needs.
The easiest way to get started with data masking is the substitution technique. It allows you to simply switch data with other records making it much harder to identify or link with other records to restore the original record. But personally, I like the shuffling technique, as it allows you to retain logical relationships in your database. However, shuffling is a more advanced method for masking data.
I’ll leave you with this: If you’re working with highly sensitive data, I always recommend using the nullification method. The nullification technique ensures that no sensitive data can be exposed.
Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!
19 NOVEMBER, 2020 by Michiel Mulders What Makes a Good Test Data Manager? Have you implemented test data management at your organization? It will surely benefit you if your organization processes critical or sensitive business data. The importance of test data is...
22 October, 2020 by Louay Hazami Data privacy is one of the most pressing issues in the new digital era. Data holds so much value for normal internet users and for all types of companies that are looking to capitalize on this new resource. To keep data anonymous and...
18 SEPTEMBER 2020 by Arnab Chowdhury Every aspect of our daily lives involves the usage of data. Be it our social media, banking account, or even while using an e-commerce site, we use data everywhere. This data may range from our names and contact information to our...
09 SEPTEMBER, 2020 by Michiel Mulders Do you want your company to scale efficiently? Look for an enterprise release manager (ERM). An ERM protects and manages the movements of releases in multiple environments. This includes build, test, and production environments....
04 AUGUST, 2020 by Michiel Mulders According to the 2019 IBM Data Breach report, the average data breach in 2019 cost 3.92 million USD. Businesses in certain industries, such as healthcare, suffer more substantial losses—6.45 million USD on average. As the amount of...
13 JULY, 2020 by Eric Boersma Every project manager in the world shares a similar stress. They’re working on something important, and a key stakeholder sticks their head around the corner. They ask a small, innocent question. “When are we going to release that...