Data masking

Data masking is the process of obscuring (masking) specific data elements within data stores. It ensures that sensitive data is replaced with realistic but not real data. The goal is that sensitive customer information is not available outside of the authorized environment. Data masking is typically done while provisioning non-production environments so that copies created to support test and development processes are not exposing sensitive information and thus avoiding risks of leaking. Masking algorithms are designed to be repeatable so referential integrity is maintained.^[1]

Common business applications require constant patch and upgrade cycles and require that 6-8 copies of the application and data be made for testing. While organizations typically have strict controls on production systems, data security in non-production instances is often left up to trusting the employee, with potentially disastrous results.

Creating test and development copies in an automated process reduces the exposure of sensitive data. Database layout often changes, it is useful to maintain a list of sensitive columns in a without rewriting application code. Data masking is an effective strategy in reducing the risk of data exposure from inside and outside of an organization and should be considered a best practice for curing non-production databases. It can be done in a copy THEN mask approach or a mask WHILE copy approach (the latter is branded as Dynamic Data Masking in some products).

Requirements

A key requirement for any data masking and obfuscation practice is that the data must remain meaningful at several levels. Firstly it must remain meaningful for the application logic. For example, if elements of addresses are to be obfuscated and city and suburbs are replaced with substitute cities or suburbs, then, if within the application there is a feature that validates postcode or post code lookup, that function must still be allowed to operate without error and operate as expected. The same is also true for Credit Card algorithm validation checks and Social Security Number validations. Secondly, the data must be sufficiently treated so that it is not obvious that the masked data is from a source of production data. For example, it may be common knowledge in an organisation that there are 10 senior managers all earning in excess of $300K. If in a test environment of the organisations HR System there are also 10 identities in the same earning bracket, then other information could be pieced together to reverse engineer a real life identity. Theoretically, if the data is obviously masked or obfuscated, then it would be reasonable for someone with data breach intentions to assume that they could reverse engineer identity data if they had some degree of knowledge of the identities in the production data set. It is for this reason that data obfucation or masking of a data set is conducted in such a manner as to ensure that identity and sensitive data records are protected and not just the individual data elements in discrete fields and tables.

Data Masking Techniques

Substitution

The Substitution is technique replaces the existing data with random values from a prepared dataset.

Shuffling

The Shuffling technique uses the existing data as its own substitution dataset and moves the values between rows in such a way that no values are present in their original rows.

Number and Date Variance

The Number and Date Variance technique varies the existing values in a specified range in order to obfuscate them. For example, birth date values could be changed within a range of +/- 60 days.

Encryption

The Encryption technique algorithmically scrambles the data. This usually does not leave the data looking realistic and can sometimes make the data larger.

Nulling Out Or Deletion

The Nulling Out technique simply removes the sensitive data by deleting it.

Masking Out

If two tables contain the columns with the same denormalized data values and those columns are masked in one table then the second table will need to be updated with the changes. This technique is called Table-To-Table Synchronization.

References

^ "Information Management Specialists". GBT. Retrieved 27 June 2012.

[1] "Information Management Specialists". GBT. Retrieved 27 June 2012.

[1]