Data sanitization
Data sanitization involves the secure and permanent erasure of sensitive data from datasets and devices to guarantee that there remains no residual data that can be recovered even through extensive forensic analysis. Data sanitization has a wide range of applications but it is mainly used for clearing out old personal electronic devices or for the sharing and use of large datasets that contain sensitive information. The main strategies for erasing personal data from devices are physical destruction, cryptographic erasure, and data erasure. Data sanitization methods are also applied for the cleaning of sensitive data, such as through heuristic based methods, machine learning based methods, and k-source anonymity[1].
This erasure is necessary as an increasing amount of data is moving to online storage, which poses a privacy risk in the situation that the old device is resold to another individual. The importance for data sanitization has risen in recent years as private information is increasingly stored in an electronic format and larger, more complex datasets are being utilized to distribute private information. Electronic storage has expanded and enabled more private data to be stored and therefore requires more advanced and thorough data sanitizaiton techniques to ensure that no data is left on the device once it is no longer in use. Technological tools that enable the transfer of large amounts of data also allow more private data to be shared. Especially with the increasing popularity of cloud-based information sharing and storage, data sanitization methods that ensure that all data shared is cleaned has become a major concern.
Clearing devices
The main use of data sanitization is for the complete clearing of devices and destruction of all sensitive data once the device is no longer in use.[2] This is an important stage in Information Lifecycle Management (ILM), an approach for ensuring privacy and data management throughout the usage of an electronic device, as it ensures that all data is completely destroyed and unrecoverable when devices reach the end of their lifecycle.[3]
There are three main methods of data sanitization for complete erasure of data: physical destruction, cryptographic erasure, and data erasure.[3] All three erasure methods aim to ensure that deleted data cannot be accessed even through advanced forensic methods, which maintains the privacy of individuals’ data even after the mobile device is no longer in use.[3]
Physical destruction
Physical erasure involves the manual destruction of stored data. This method uses mechanical shredders or degaussers to shred devices, such as phones, computers, hard drives, and printers into small individual pieces.
Degaussing is most commonly used on hard disk drives (HDDs), and involves the utilization of high energy magnetic fields to permanently disrupt the functionality and memory storage of the device. When data is exposed to this strong magnetic field, any memory storage is neutralized and can not be recovered or used again. Degaussing is not applicable to solid state disks (SSDs) as the data is not stored using magnetic methods.
Physical destruction often ensures that data is completely erased and cannot be used again. However, the physical byproducts of mechanical waste from mechanical shredding can be damaging to the environment. Furthermore, once data is physically destroyed, it can no longer be resold or used again.
Cryptographic erasure
Cryptographic erasure involves the destruction of the secure key, or passphrase, that is used to protect stored information. Data encryption involves the development of a secure key that only enables authorized parties to gain access to the data that is stored. The permanent erasure of this key ensures that the private data stored can no longer be accessed. Cryptographic erasure is commonly installed through manufactures of the device itself as encryption software is often built into the device. Encryption with key erasure involves encrypting all sensitive material in a way that requires a secure key to decrypt the information when it needs to be used.[4] When the information needs to be deleted, the secure key can be erased. This provides a higher ease of use than other software methods because it involves one deletion of secure information rather than each individual file.
Cryptographic erasure is often used for data storage that does not contain as much private information since there is a possibility that errors can occur due to manufacturing failures or human error during the process of key destruction. This creates a wider range of possible results of data erasure. This method allows for data to continue to be stored on the device and does not require that the device be completely erased. This way, the device can be resold again to another individual or company since the physical integrity of the device itself is maintained.
Data erasure
The process of data erasure involves masking all information at the byte level through the insertion of random 0s and 1s in on all sectors of the electronic equipment that is no longer in use.[3] This software based method ensures that all data previous stored is completely hidden and unrecoverable, which ensures full data sanitization. The efficacy and accuracy of this sanitization method can also be analyzed through audit-able reports.[5] Data erasure can also be achieved through encryption with key erasure.[4] Encryption with key erasure involves encrypting all sensitive material in a way that requires a secure key to decrypt the information when it needs to be used.[4] When the information needs to be deleted, the secure key can be erased. This provides a higher ease of use than some other software methods because it involves one deletion of secure information rather than each individual file.
Data erasure often ensures complete sanitization while also maintaining the physical integrity of the electronic equipment so that the technology can be resold or reused. This ability to recycle technological devices makes data erasure a more environmentally sound version of data sanitization. This method is also the most accurate and comprehensive since the efficacy of the data masking can be tested afterwards to ensure complete deletion. However, data erasure through software based mechanisms requires more time compared to other methods.
Necessity of data sanitization
There has been increased usage of mobile devices, Internet of Things (IoT) technologies, cloud-based storage systems, portable electronic devices, and various other electronic methods to store sensitive information, therefore implementing effective erasure methods once the device is not longer in use has become crucial to protect sensitive data.[6] Due to the increased usage of electronic devices in general and the increased storage of private information on these electronic devices, the need for data sanitization has been much more urgent in recent years.[7]
There are also certain methods of sanitization that do not fully clean devices of private data which can prove to be problematic. For example, some remote wiping methods on mobile devices are vulnerable to outside attacks and efficacy depends on the unique efficacy of each individual software system installed.[6] Remote wiping involves sending a wireless command to the device when it has been lost or stolen that directs the device to completely wipe out all data. While this method can be very beneficial, it also has several drawbacks. For example, the remote wiping method can be manipulated by attackers to signal the process when it is not yet necessary. This results in incomplete data sanitization. If attackers do gain access to the storage on the device, the user risks exposing all private information that was stored.
Cloud computing and storage has become an increasingly popular method of data storage and transfer. However, there are certain privacy challenges associated with cloud computing that have not been fully explored.[8] Cloud computing is vulnerable to various attacks such as through code injection, the path traversal attack, and resource depletion because of the shared pool structure of these new techniques.These cloud storage models require specific data sanitization methods to combat these issues. If data is not properly removed from cloud storage models, it opens up the possibility for security breaches at multiple levels.
Applications of data sanitization
Data sanitization methods are also implemented for privacy preserving data mining, association rule hiding, and blockchain-based secure information sharing. These methods involve the transfer and analysis of large datasets that contain private information. This private information needs to be sanitized before being made available online so that sensitive material is not exposed. Data sanitization is used to ensure privacy is maintained in the dataset, even when it is being analyzed.
Privacy preserving data mining
Privacy Preserving Data Mining (PPDM) is the process of data mining while maintaining privacy of sensitive material. Data mining involves analyzing large datasets to gain new information and draw conclusions. PPDM has a wide range of uses and is an integral step in the transfer or use of any large data set containing sensitive material.
Data sanitization is an integral step to privacy preserving data mining because private datasets need to be sanitized before they can be utilized by individuals or companies for analysis. The aim of privacy preserving data mining is to ensure that private information cannot be leaked or accessed by attackers and sensitive data is not traceable to individuals that have submitted the data.[citation needed] Privacy preserving data mining aims to maintain this level of privacy for individuals while also maintaining the integrity and functionality of the original dataset.[9] In order for the dataset to be used, necessary aspects of the original data need to be protected during the process of data sanitization. This balance between privacy and utility has been the primary goal of data sanitization methods.[9]
One approach to achieve this optimization of privacy and utility is through encrypting and decrypting sensitive information using a process called key generation.[9] After the data is sanitized, key generation is used to ensure that this data is secure and cannot be tampered with. Approaches such as the Rider optimization Algorithm (ROA), also called Randomized ROA (RROA) use these key generation strategies to find the optimal key so that data can be transferred without leaking sensitive information.[9]
Some versions of key generation have also been optimized to fit larger datasets. For example, a novel, method-based Privacy Preserving Distributed Data Mining strategy is able to increase privacy and hide sensitive material through key generation. This version of sanitization allows large amount of material to be sanitized. For companies that are seeking to share information with several different groups, this methodology may be preferred over original methods that take much longer to process.[10]
Certain models of data sanitization delete or add information to the original database in an effort to preserve the privacy of each subject. These heuristic based algorithms are beginning to become more popularized, especially in the field of association rule mining. Heuristic methods involve specific algorithms that use pattern hiding, rule hiding, and sequence hiding to keep specific information hidden. This type of data hiding can be used to cover wide patterns in data, but is not as effective for specific information protection. Heuristic based methods are not as suited to sanitizing large datasets, however, recent developments in the heuristics based field have analyzed ways to tackle this problem. An example includes the MR-OVnTSA approach, a heuristics based sensitive pattern hiding approach for big data, introduced by Shivani Sharma and Durga Toshniwa.[7] This approach uses a heuristics based method called the ‘MapReduce Based Optimum Victim Item and Transaction Selection Approach’, also called MR-OVnTSA, that aims to reduce the loss of important data while removing and hiding sensitive information. It takes advantage of algorithms that compare steps and optimize sanitization.[7]
An important goal of PPDM is to strike a balance between maintaining the privacy of users that have submitted the data while also enabling developers to make full use of the dataset. Many measures of PPDM directly modify the dataset and create a new version that makes the original unrecoverable. It strictly erases any sensitive information and makes it inaccessible for attackers.
Association rule mining
One type of data sanitization is rule based PPDM, which uses defined computer algorithms to clean datasets. Association rule hiding is the process of data sanitization as applied to transactional databases.[11] Transactional databases are the general term for data storage used to record transactions as organizations conduct their business. Examples include shipping payments, credit card payments, and sales orders. This source analyzes fifty four different methods of data sanitization and presents its four major findings of its trends
Certain new methods of data sanitization that rely on machine deep learning. There are various weaknesses in the current use of data sanitization. Many methods are not intricate or detailed enough to protect against more specific data attacks.[12] This effort to maintain privacy while dating important data is referred to as privacy-preserving data mining. Machine learning develops methods that are more adapted to different types of attacks and can learn to face a broader range of situations. Deep learning is able to simplify the data sanitization methods and run these protective measures in a more efficient and less time consuming way.
There have also been hybrid models that utilize both rule based and machine deep learning methods to achieve a balance between the two techniques.
Blockchain-based secure information sharing
Browser backed cloud storage systems are heavily reliant on data sanitization and are becoming an increasingly popular route of data storage.[13] Furthermore, the ease of usage is important for enterprises and workplaces that use cloud storage for communication and collaboration.[8]
Blockchain is used to record and transfer information in a secure way and data sanitization techniques are required to ensure that this data is transferred more securely and accurately. It’s especially applicable for those working in supply chain management and may be useful for those looking to optimize the supply chain process.[14] For example, the Whale Optimization Algorithm (WOA), uses a method of secure key generation to ensure that information is shared securely through the blockchain technique.[14] The need to improve blockchain methods is becoming increasingly relevant as the global level of development increases and becomes more electronically dependent.
Industry specific applications
Healthcare
The healthcare industry is an important sector that relies heavily on data mining and use of datasets to store confidential information about patients. The use of electronic storage has also been increasing in recent years, which requires more comprehensive research and understanding of the risks that it may pose. Currently, data mining and storage techniques are only able to store limited amounts of information. This reduces the efficacy of data storage and increases the costs of storing data. New advanced methods of storing and mining data that involve cloud based systems are becoming increasingly popular as they are able to both mine and store larger amounts of information.
Risks posed by inadequate sanitization
Inadequate data sanitization methods can result in two main problems: a breach of private information and compromises to the integrity of the original dataset. If data sanitization methods are unsuccessful at removing all sensitive information, it poses the risk of leaking this information to attackers.[8] Numerous studies have been conducted to optimize ways of preserving sensitive information. Some methods of data sanitization have a high sensitivity to distinct points that have no closeness to data points. This type of data sanitization is very precise and can detect anomalies even if the poisoned data point is relatively close to true data.[15] Another method of data sanitization is one that also removes outliers in data, but does so in a more general way. It detects the general trend of data and discards any data that strays and it’s able to target anomalies even when inserted as a group.[15] In general, data sanitization techniques use algorithms to detect anomalies and remove any suspicious points that may be poisoned data or sensitive information.
Furthermore, data sanitization methods may remove useful, non-sensitive information, which then renders the sanitized dataset less useful and altered from the original. There have been iterations of common data sanitization techniques that attempt to correct the issue of the loss of original dataset integrity. In particular, Liu, Xuan, Wen, and Song offered a new algorithm for data sanitization called the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method.[16] There is often a lot of emphasis that is put into protecting the privacy of users, so this method brings a new perspective that focuses on also protecting the integrity of the data. It functions in a way that has three main advantages: it learns to optimize the process of sanitization by only cleaning the item with the highest conflict count, keeps parts of the dataset with highest utility, and also analyzes the conflict degree of the sensitive material. Robust research was conducted on the efficacy and usefulness of this new technique to reveal the ways that it can benefit in maintaining the integrity of the dataset. This new technique is able to firstly pinpoint the specific parts of the dataset that are possibly poisoned data and also use computer algorithms to make a calculation between the tradeoffs of how useful it is to decide if it should be removed.[16] This is a new way of data sanitization that takes into account the utility of the data before it is immediately discarded.
References
- ^ "K - anonymity: An Introduction". Privitar. 2017-04-07. Retrieved 2021-06-12.
- ^ "Data Sanitization | University IT". uit.stanford.edu. Retrieved 2021-04-30.
- ^ a b c d "Data Sanitization Terminology and Definitions". International Data Sanitization Consortium. Retrieved 2021-04-30.
- ^ a b c Diesburg, Sarah M.; Wang, An-I Andy (2010-12-03). "A survey of confidential data storage and deletion methods" (PDF). ACM Computing Surveys. 43 (1): 2:1–2:37. CiteSeerX 10.1.1.188.3969. doi:10.1145/1824795.1824797. S2CID 3336775.
- ^ "What is Data Sanitization? | Data Erasure Methods | Imperva". Learning Center. Retrieved 2021-04-30.
- ^ a b Leom, Ming Di; Choo, Kim-Kwang Raymond; Hunt, Ray (2016). "Remote Wiping and Secure Deletion on Mobile Devices: A Review". Journal of Forensic Sciences. 61 (6): 1473–1492. doi:10.1111/1556-4029.13203. PMID 27651127. S2CID 20563918.
- ^ a b c Sharma, Shivani; Toshniwal, Durga (2020-12-01). "MR-OVnTSA: a heuristics based sensitive pattern hiding approach for big data". Applied Intelligence. 50 (12): 4241–4260. doi:10.1007/s10489-020-01749-6. S2CID 220542429.
- ^ a b c Tabrizchi, Hamed; Kuchaki Rafsanjani, Marjan (2020-12-01). "A survey on security challenges in cloud computing: issues, threats, and solutions". The Journal of Supercomputing. 76 (12): 9493–9532. doi:10.1007/s11227-020-03213-1. S2CID 211539375.
- ^ a b c d Shivashankar, Mohana; Mary, Sahaaya Arul (2021). "Privacy preservation of data using modified rider optimization algorithm: Optimal data sanitization and restoration model". Expert Systems. 38 (3): e12663. doi:10.1111/exsy.12663.
- ^ Lekshmy, P. L.; Rahiman, M. Abdul (2020-07-01). "A sanitization approach for privacy preserving data mining on social distributed environment". Journal of Ambient Intelligence and Humanized Computing. 11 (7): 2761–2777. doi:10.1007/s12652-019-01335-w. S2CID 198324918.
- ^ Telikani, Akbar; Shahbahrami, Asadollah (2018). "Data sanitization in association rule mining: An analytical review". Expert Systems with Applications. 96: 406–426. doi:10.1016/j.eswa.2017.10.048.
- ^ Ahmed, Usman; Srivastava, Gautam; Lin, Jerry Chun-Wei (2021). "A Machine Learning Model for Data Sanitization". Computer Networks. 189: 107914. doi:10.1016/j.comnet.2021.107914.
- ^ Balashunmugaraja, B.; Ganeshbabu, T. R. (2020-05-30). "Optimal Key Generation for Data Sanitization and Restoration of Cloud Data: Future of Financial Cyber Security". International Journal of Information Technology & Decision Making. 19 (4): 987–1013. doi:10.1142/S0219622020500200.
- ^ a b Abidi, Mustufa Haider; Alkhalefah, Hisham; Umer, Usama; Mohammed, Muneer Khan (2020). "Blockchain‐based secure information sharing for supply chain management: Optimization assisted data sanitization process". International Journal of Intelligent Systems. 36 (1): 260–290. doi:10.1002/int.22299. S2CID 227249700.
- ^ a b Koh, Pang Wei; Steinhardt, Jacob; Liang, Percy (2018-11-01). "Stronger Data Poisoning Attacks Break Data Sanitization Defenses". arXiv:1811.00741 [stat.ML].
- ^ a b Liu, Xuan; Chen, Genlang; Wen, Shiting; Song, Guanghui (2020-05-31). "An Improved Sanitization Algorithm in Privacy-Preserving Utility Mining". Mathematical Problems in Engineering. 2020: 1–14. doi:10.1155/2020/7489045.