Data sanitization

Data sanitization involves the process of permanently removing and hiding sensitive information during the usage of datasets for study or transfer of information from one device to another. This technique is essential for taking useful information from original databases while avoiding infringing on private information that may be stored in these databases. In recent decades, there has been an increasing use of database information in the generation of electronic tools, such as 5G mobile data. There has also been increasing usage of Internet of Things (IoT) technologies. IoT technologies refer to smart devices equipped with sensors, cameras, recording devices, and other sensory tools that are then linked directly to other devices through the internet. These devices are able to easily transfer information from one device to another, however, this ease of transfer also poses a major privacy challenge. There are many risks associated with transferring information because sensitive, raw data needs to be removed in between. For example, devices like Alexa and Google Home need to be equipped with data sanitization tools that eliminate the leakage of private data that may be collected. Data sanitization is also commonly referred to as Privacy Preserving Data Mining, or PPDM, as it aims to preserve important information while using algorithms to filter out sensitive details. Currently, many models of data sanitization rely on heuristic methods that delete or add information to the original database in an effort to preserve the privacy of each subject. However, there have also been numerous new developments of PPDM that rely instead on machine learning and deep learning techniques.

Applications of Data Sanitization

Privacy Preserving Data Mining (PPDM) has a wide range of uses and is an integral step in the transfer or use of any large data set. It is also commonly linked to blockchain-based secure information sharing within supply chain management systems.

5G data
Internet of Things (IoT) technologies eg: Alexa, Google Home, etc.
Healthcare industry, using large datasets
Supply chain industry, usage of blockchain and optimal key generation

Browser backed cloud storage systems are heavily reliant on data sanitization and are becoming an increasingly popular route of data storage. Furthermore, the ease of usage is important for enterprises and workplaces that use cloud storage for communication and collaboration.

Data sanitization is especially relevant for the medical field or large public organizations that need to use very large databases of sensitive data. It's those organizations that need to find efficient ways to hide sensitive data while maintaining functionality.

Blockchain is used to record and transfer information in a secure way and data sanitization techniques are required to ensure that this data is transferred more securely and accurately. It’s especially applicable for those working in supply chain management and may be useful for those looking to optimize the supply chain process. The need to improve blockchain methods is becoming increasingly relevant as the global level of development increases and becomes more electronically dependent.

Risks Associated

Inadequate data sanitization methods can result in two main problems: a breach of private information and compromises to the integrity of the original dataset. If data sanitization methods are unsuccessful at removing all sensitive information, it poses the risk of leaking this information to attackers. Numerous studies have been conducted to optimize ways of preserving sensitive information. Some methods of data sanitization have a high sensitivity to distinct points that have no closeness to data points. This type of data sanitization is very precise and can detect anomalies even if the poisoned data point is relatively close to true data. Another method of data sanitization is one that also removes outliers in data, but does so in a more general way. It detects the general trend of data and discards any data that strays and it’s able to target anomalies even when inserted as a group. In general, data sanitization techniques use algorithms to detect anomalies and remove any suspicious points that may be poisoned data or sensitive information.

Furthermore, data sanitization methods may remove useful, non-sensitive information, which then renders the sanitized dataset less useful and altered from the original. There have been iterations of common data sanitization techniques that attempt to correct the issue of the loss of original dataset integrity. In particular, Liu, Xuan, Wen, and Song offered a new algorithm for data sanitization called the Improved Minimum Sensitive Itemsets Conflict First Algorithm (IMSICF) method. There is often a lot of emphasis that is put into protecting the privacy of users, so this method brings a new perspective that focuses on also protecting the integrity of the data. It functions in a way that has three main advantages: it learns to optimize the process of sanitization by only cleaning the item with the highest conflict count, keeps parts of the dataset with highest utility, and also analyzes the conflict degree of the sensitive material. Robust research was conducted on the efficacy and usefulness of this new technique to reveal the ways that it can benefit in maintaining the integrity of the dataset. This new technique is able to firstly pinpoint the specific parts of the dataset that are possibly poisoned data and also use computer algorithms to make a calculation between the tradeoffs of how useful it is to decide if it should be removed. This is a new way of data sanitization that takes into account the utility of the data before it is immediately discarded.

Methods of Data Sanitization

An important goal of PPDM is to strike a balance between maintaining the privacy of users that have submitted the data while also enabling developers to make full use of the dataset. Many measures of PPDM directly modify the dataset and create a new version that makes the original unrecoverable. It strictly erases any sensitive information and makes it inaccessible for attackers.

One type of data sanitization is rule based PPDM that uses defined computer algorithms to clean datasets. Association rule hiding is the process of data sanitization as applied to transactional databases. Transactional databases are the general term for data storage used to record transactions as organizations conduct their business. Examples include shipping payments, credit card payments, and sales orders. This source analyzes fifty four different methods of data sanitization and presents its four major findings of its trends

Certain new methods of data sanitization that rely on machine deep learning. There are various weaknesses in the current use of data sanitization. Many methods are not intricate or detailed enough to protect against more specific data attacks. This effort to maintain privacy while dating important data is referred to as privacy-preserving data mining. Machine learning develops methods that are more adapted to different types of attacks and can learn to face a broader range of situations. Deep learning is able to simplify the data sanitization methods and run these protective measures in a more efficient and less time consuming way.

There have also been hybrid models that utilize both rule based and machine deep learning methods to achieve a balance between the two techniques.