Fuzzy hashing
This article, Fuzzy hashing, has recently been created via the Articles for creation process. Please check to see if the reviewer has accidentally left this template after accepting the draft and take appropriate action as necessary.
Reviewer tools: Inform author |
Fuzzy hashing, also known as similarity hashing[1], is a technique for detecting data that is similar, but not exactly the same, as other data. This is in contrast to cryptographic hash functions, which are designed to have significantly different hashes for even minor differences. Fuzzy hashing has been used to identify malware and has potential for other applications, like data loss prevention [2].
Background
A hash function is a mathematical algorithm which maps arbitrary-sized data to a fixed size output[3]. Many solutions use cryptographic hash functions like SHA-256 to detect duplicates or check for known files within large collection of files[4]. However, cryptographic hash functions cannot be used for determining if a file is similar to a known file, because one of the requirements of a cryptographic hash function is that a small change to the input should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect)[5].
Fuzzy hashing exists to solve this problem of detecting data that is similar, but not exactly the same, as other data. Fuzzy hashing algorithms specifically use algorithms in which two similar inputs will generate two similar hash values. This property is the exact opposite of the avalanche effect desired in cryptographic hash functions.
Fuzzy hashing can also be used to detect when one object is contained within another.[1]
Approaches for fuzzy hashing
There are a few approaches used for building fuzzy hash algorithms:[6]
- Context Triggered Piecewise Hashing (CTPH), which constructs a hash by splitting the input into multiple pieces, calculating traditional hashes for each piece, and then combining those traditional hashes into a single string.
- Locality Sensitive Hashing places similar input items into the same "buckets", which can be used for data clustering and nearest neighbor searches
Notable tools
- spamsum is a tool written by Andrew Tridgell that uses fuzzy hashing to determine whether an email is similar to known spam. It operates by generating a fuzzy hash for an email that it compares against the fuzzy hashes from known spam emails to generate a match result between 0 (complete mismatch) to 100 (perfect match). If the match result is high enough, the email is classified as spam.[7] [8]
- ssdeep is a program that uses fuzzy hashing to compare binary files
ssdeep Concept of Operation
ssdeep operates by splitting the file into separate pieces, and calculating a hash for each piece. Those piecewise hashes are then combined to create a single hash value that can be compared to other hash values to compute the distance between them. The lower the distance, the more similar the files are to each other.
Reread:
- https://web.archive.org/web/20160517070757/http://dfrws.org/2006/proceedings/12-Kornblum.pdf
- https://medium.com/pythonic-forensics/fuzzy-hashing-and-ctph-69f4a7bbfe48
- https://www.yumpu.com/en/document/read/51148786/fuzzy-hashing-jesse-kornblum
- https://en.wikipedia.org/wiki/Locality-sensitive_hashing
- https://github.com/trendmicro/tlsh
- https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.6695&rep=rep1&type=pdf
- https://ssdeep-project.github.io/ssdeep/index.html
- US Patent: https://patents.google.com/patent/US20110093426A1/en from [Greg Hoglund]
- Also referred as a "Fuzzy checksum" by https://en.wikipedia.org/wiki/Checksum#Fuzzy_checksum / https://cwiki.apache.org/confluence/display/spamassassin/iXhash
- http://www.csroc.org.tw/journal/JOC30_2/JOC3002-13.pdf
Applications
Fuzzy Hashing has applications for the following scenarios:
- Altered document matching - https://www.sciencedirect.com/science/article/pii/S1742287606000764?via%3Dihub
- Partial file matching - https://www.sciencedirect.com/science/article/pii/S1742287606000764?via%3Dihub
- Malware similarity - https://www.blackhat.com/presentations/bh-usa-09/RAYGOZA/BHUSA09-Raygoza-MalwareSimAnalysis-PAPER.pdf
See also
References
- ^ a b "NIST Special Publication 800-168" (PDF). NIST.gov. Retrieved June 30, 2022.
- ^ Kornblum, Jesse (2006). "Identifying almost identical files using context triggered piecewise hashing". Digital Investigation. 3, Supplement (September 2006): 91–97. doi:10.1016/j.diin.2006.06.015. Retrieved June 30, 2022.
- ^ Schueffel, Patrick; Groeneweg, Nikolaj; Baldegger, Rico (2019). The Crypto Encyclopdia - Coins, tokens and digital assets from A to Z. Bern, Fribourg: Growth Publisher / HEG Fribourg. p. 27. ISBN 978-2-940384-47-1.
- ^ Kornblum, Jesse (2006). "Identifying almost identical files using context triggered piecewise hashing". Digital Investigation. 3, Supplement (September 2006): 91–97. doi:10.1016/j.diin.2006.06.015. Retrieved June 30, 2022.
- ^ Al-Kuwari, Saif; Davenport, James H.; Bradford, Russell J. (2011). "Cryptographic Hash Functions: Recent Design Trends and Security Notions". Cryptology ePrint Archive. Report 2011/565.
- ^ Jonathan Oliver; et al. (2021). "Designing the Elements of a Fuzzy Hashing Scheme" (PDF). Archived (PDF) from the original on 14 April 2021. Retrieved 14 April 2021.
- ^ "spamsum README". samba.org. Retrieved December 11, 2022.
- ^ "spamsum.c". samba.org. Retrieved December 11, 2022.