Data masking
Data masking or [1] is the process of de-identifying (masking) specific data elements within data stores.
The main reason for applying masking to a data field is to protect data that is classified as personal identifiable data, personal sensitive data or commercially sensitive data, however the data must remain usable for the purposes of undertaking valid test cycles. It must also look real and appear consistent. It is more common to have masking applied to data that is represented outside of a corporate production system. In other words where data is needed for the purpose of application development, building program extensions and conducting various test cycles. It is common practice in enterprise computing to take data from the production systems to fill the data component, required for these non-production environments. However the practice is not always restricted to non-production environments. In some organisations, data that appears on terminal screens to call centre operators may have masking dynamically applied based on user security permissions. (eg: Preventing call centre operators from viewing Credit Card Numbers in billing systems)
The primary concern from a corporate governance perspective is that personnel conducting work in these non-production environments are not always security cleared to operate with the information contained in the production data. This practice represents a security hole where data can be copied by unathorised personnel and security measures associated with standard production level controls can be easily bypassed. This represents an access point for a data security breach.
Requirements
A key requirement for any data masking and obfuscation practice is that the data must remain meaningful at several levels. Firstly it must remain meaningful for the application logic. For example, if elements of addresses are to be obfuscated and city and suburbs are replaced with substitute cities or suburbs, then, if within the application there is a feature that validates postcode or post code lookup, that function must still be allowed to operate without error and operate as expected. The same is also true for Credit Card algorithm validation checks and Social Security Number validations. Secondly, the data must be sufficiently treated so that it is not obvious that the masked data is from a source of production data. For example, it may be common knowledge in an organisation that there are 10 senior managers all earning in excess of $300K. If in a test environment of the organisations HR System there are also 10 identities in the same earning bracket, then other information could be pieced together to reverse engineer a real life identity. Theoretically, if the data is obviously masked or obfuscated, then it would be reasonable for someone with data breach intentions to assume that they could reverse engineer identity data if they had some degree of knowledge of the identities in the production data set. It is for this reason that data obfucation or masking of a data set is conducted in such a manner as to ensure that identity and sensitive data records are protected and not just the individual data elements in discrete fields and tables.
Data Masking Techniques
Substitution
Substitution is one of the most effective methods of applying data masking and being able to preserve the authentic look and feel of the data record.
It allows the masking to be performed in such a manner that another authentic looking value can be substituted for the existing value. There are several data field types where this approach provides optimal benefit in disguising the the overall data sub set as to whether or not it is a masked data set. For Example, if dealing with source data which contains customer records, real life surname or first name can be randomly substituted from a supplied or customised look up file. If the first pass of the substitution allows for applying a male first name to all first names, then the second pass would need to allow for applying a female first name to all first names where gender equals "F". Using this approach we could easily maintain the gender mix within the data structure, apply anonymity to the data records but also maintain a realistic looking database which could not easily be identified as a database consisting of masked data.
This substitution method needs to be applied for many of the fields that are in DB structures across the world, such as telephone numbers, zip codes and postcodes, as well as credit card numbers and other card type numbers like Social Security numbers and Medicare numbers where these numbers actually need to conform to a checksum test of the Luhn algorithm.
In most cases the substitution files will need to be fairly extensive so having large substitution datasets as well the ability to apply customised data substitution sets should be a key element of the evaluation criteria for any data masking solution.
Shuffling
The shuffling method is a very common form of data obfuscation. It is similar to the substitution method but it derives the substitution set from the same column of data that is is being masked. In very simple terms the data is randomly shuffled within the column. However if used in isolation, aCite error: There are <ref> tags on this page without content in them (see the help page).nyone with any knowledge of the original data can then apply a "What If" scenario to the data set and then piece back together a real identity. The shuffling method is also open be reversed if the shuffling algorithm can be deciphered.
Shuffling however is a great technique to include in your overall masking approach as it has some real strengths in certain areas. If for instance, you need to maintain the end of year figures for your financial information in that test data base. You could mask the names of the suppliers and then shuffle the value of your accounts throughout your masked database. It is highly unlikely that anyone, even someone with intimate knowledge of the original data could derive a true data record back to its original values.
Number and Date Variance
The numeric variance method is very useful for applying to financial and date driven information fields. Effectively, a method utilising this manner of masking can still leave a meaningful range in a financial data set such as payroll. If the variance applied is around +/- 10% then it is still a very meaningful data set in terms of the ranges of salaries that are paid to the recipients.
The same also applies to the date information. If the overall data set needs to retain demographic and actuarial data integrity then applying a random numeric variance of +/- 120 days to date fields would preserve the date distribution but still prevent traceability back to a known entity based on their known actual date or birth or a known date value of whatever record is being masked.
Encryption
Encryption is often the most complex approach to solving the data masking problem. The encryption algorithm often requires that a "key" is applied to view the data based on user rights. This often sounds like the best solution but in real terms the key is then been given out to personnel without the proper rights to view the data and this then defeats the purpose of the masking exercise. Old data bases are copied with the original credentials of the supplied key and the same uncontrolled problem lives on.
Encryption algorithms can also turn the encrypted data value into a binary element that with then have issues with validation in the application logic feature if applicable. Which then means that full user rights need to be granted to the testers. What sometimes sounds like a great idea can also be very problematic to execute. The data encryption method of masking requires extensive design and testing to ensure that the method is fit for purpose for your data type and application. Even if testing is on an application which requires a military clearance of "Secret" or "Top Secret" it would most likely be easier to ensure that all testers were properly cleared to view the test / non-production data.
Nulling Out Or Deletion
Sometimes a very simplistic approach to masking is adopted through applying a null value to a particular field. The null value approach is really only useful to prevent visibility of the data element.
In almost all cases it lessens the degree of data integrity that is maintained in the masked data set. It is not a realistic value and will then fail any application logic validation that may have been applied in the front end software that is in the system under test. It also highlights to anyone that wishes to reverse engineer any of the identity data that data masking has been applied to some degree on the data set.
Masking Out
Character Scrambling or Masking Out of certain fields is also another simplistic yet very effective method of preventing sensitive information to be viewed. It is really an extension of the previous method of nulling out but there is greater emphasis on keeping the data real and not fully masked all together.
This is commonly applied to credit card data in production systems. For instance, you may have spoken with an operator in a Call Centre and they have suggested they could bill an item to your credit card. They then quote you a billing reference to your card with the last 6 digits of XXXX XXXX xx45 6789. As an operator they can only see the last 6 digits of your card number but once the billing system passes your details for charging the full number is revealed to the payment gateway systems.
This system is not very effective for test systems but is very useful for the billing scenario detailed above. It is also commonly known as a dynamic data masking method.
- ^ data obfuscation
- ^
{{cite web}}: Empty citation (help) - ^ Martin, Brendan. "What is Data Obfuscation". http://www.datakitchen.com.au. Retrieved 21 April 2013.
{{cite web}}: External link in(help)|publisher=